Inter
national
J
our
nal
of
Electrical
and
Computer
Engineering
(IJECE)
V
ol.
6,
No.
3,
June
2016,
pp.
945
–
954
ISSN:
2088-8708,
DOI:
10.11591/ijece.v6i3.9030
945
I
ns
t
it
u
t
e
o
f
A
d
v
a
nce
d
Eng
ine
e
r
i
ng
a
nd
S
cie
nce
w
w
w
.
i
a
e
s
j
o
u
r
n
a
l
.
c
o
m
F
ast
Document
Summarization
using
Locality
Sensiti
v
e
Hashing
and
Memory
Access
Efficient
Node
Ranking
Er
can
Canhasi
*
*
F
aculty
of
Computer
Science,
Uni
v
ersity
of
Prizren
Article
Inf
o
Article
history:
Recei
v
ed
Feb
5,
2016
Re
vised
May
7,
2016
Accepted
May
19,
2016
K
eyw
ord:
Document
summarization
Locality
Sensiti
v
e
Hashing
I/O
Access
Ef
ficient
Node
Ranking
Min-Hashing
T
imeline
summarization
Comparati
v
e
summarization
ABSTRA
CT
T
e
xt
modeling
and
sentence
selection
are
the
fundamental
steps
of
a
typical
e
xtracti
v
e
doc-
ument
summarization
algorithm.
The
common
te
xt
modeling
method
connects
a
pair
of
sentences
based
on
their
simi
larities.
Ev
en
thought
it
can
ef
fecti
v
ely
represent
the
sentence
similarity
graph
of
gi
v
en
document(s)
its
big
dra
wback
is
a
lar
ge
time
comple
xity
of
O
(
n
2
)
,
where
n
represents
the
number
of
sentences
.
The
quadratic
time
comple
xity
mak
es
it
im-
practical
for
lar
ge
documents.
In
this
paper
we
propose
the
f
ast
approximat
ion
algorithms
for
the
te
xt
modeling
and
the
sentence
selection.
Our
te
xt
modeling
algorithm
reduce
s
the
time
comple
xity
to
near
-linear
time
by
rapidly
finding
the
most
similar
sentences
to
form
the
sentences
similarity
graph.
In
doing
so
we
utili
zed
Locality-Sensiti
v
e
Hashi
ng,
a
f
ast
algorithm
for
the
approximate
nearest
neighbor
search.
F
or
the
sentence
selection
step
we
propose
a
simple
memory-access-ef
ficient
node
ranking
method
based
on
the
idea
of
scan-
ning
sequentially
only
the
neighborhood
arrays.
Experimentally
,
we
sho
w
that
sacrificing
a
rather
small
percentage
of
recall
and
precision
in
the
quality
of
the
produced
summary
can
reduce
the
quadratic
to
sub-linear
time
comple
xity
.
W
e
see
the
big
potential
of
proposed
method
in
te
xt
summarization
for
mobile
de
vices
and
big
te
xt
data
summarization
for
in-
ternet
of
things
on
cloud.
In
our
e
xperiments,
beside
e
v
aluating
the
presented
method
on
the
standard
general
and
query
multi-document
summa
rization
tasks,
we
also
tested
it
on
fe
w
alternati
v
e
summarization
tasks
including
general
and
query
,
timeline,
and
comparati
v
e
summarization.
Copyright
c
2016
Institute
of
Advanced
Engineering
and
Science
.
All
rights
r
eserved.
Corresponding
A
uthor:
Ercan
Canhasi
F
aculty
of
Computer
Science,
Uni
v
ersity
of
Prizren
Rrug
a
e
Shkronja
v
e,
nr
.
1
20000
Prizren,
K
oso
v
e
ercan.canhasi@uni-prizren.com
1.
INTR
ODUCTION
The
w
orld
is
floating
on
data.
These
data
are
mainly
coming
from
the
w
ord
wide
web
which
is
e
xpanding
e
xponentially
making
massi
v
e
v
olume
of
the
online
information
a
v
aila
b
l
e
for
users.
F
or
years
we
ha
v
e
been
af
fected
by
the
quantity
of
data
streaming
through
and
produced
by
our
systems.
Automatic
document
summarization
as
the
complementary
tool
to
re
gular
web
search
engines
can
be
used
to
scale
do
wn
this
problem
of
information
o
v
erload.
Since
the
most
of
mobile
and
interacti
v
e
ubiquitous
multimedia
de
vices
ha
v
e
restricted
hardw
are
such
as
CPU,
mem-
ory
,
and
display
screen
it
is
essential
to
compress
an
document
collection
to
a
brief
s
ummary
before
it
is
del
i
v
ered
to
the
user
of
these
de
vices.
Other
t
echnology
trends
which
can
lar
gely
benefit
from
a
scalable
document
summarization
methods
are
the
Internet
of
Things
(IoT)
on
Cloud
and
Big
data.
The
later
is
about
the
processing
and
analysis
of
lar
ge
data
repositories
on
Cloud
computing.
Big
document
summarization
method
is
an
important
technique
for
data
management
of
IoT
.
T
raditional
document
summarization
methods
are
restricted
to
summarize
suitable
information
from
the
e
xploding
IoT
big
data
on
Cloud.
The
main
task
of
the
e
xtraction
based
multi-document
summarization
is
to
e
xtract
the
most
important
sen-
tences
from
multiple
documents
and
format
them
into
a
summary
.
According
to
the
number
of
documents
to
be
summarized,
the
summary
can
be
a
single
document
or
a
multi-document.
Query-focused
multi-document
summa-
rization
is
a
special
case
of
multi-document
summarization.
Gi
v
en
a
query
,
the
task
is
to
produce
a
summary
which
can
respond
to
the
information
required
by
the
query
.
T
w
o
other
specific
document
summarization
tasks
treated
in
J
ournal
Homepage:
http://iaesjournal.com/online/inde
x.php/IJECE
I
ns
t
it
u
t
e
o
f
A
d
v
a
nce
d
Eng
ine
e
r
i
ng
a
nd
S
cie
nce
w
w
w
.
i
a
e
s
j
o
u
r
n
a
l
.
c
o
m
Evaluation Warning : The document was created with Spire.PDF for Python.
946
ISSN:
2088-8708
T
able
1.
Observ
ed
e
x
ecution
time
spent
on
calculations
(in
seconds).
The
time
elapsed
in
computing
the
summaries
are
measured
on
processor
with
follo
wing
specifications:
Intel(R)
Core(TM)
i5
CPU
M
450
@
2.45GHz
with
4GB
RAM
memory
.
The
first
tw
o
columns,
document(s)
length
in
KB,
and
the
total
number
of
sentences,
represent
the
input
v
alues.
While
the
rest
four
columns,
te
xt
modeling
by
means
of
LHS
and
con
v
entional
all-to-all
comparing
methods,
sentece
selection
with
I/O
access
ef
ficient
node
ranking
and
archetypal
analysis
based
sentence
selection,
are
the
measured
times
or
the
output
v
alues
of
the
e
xperiment.
Doc.(s)
length
#
of
sentences
T
e
xt
modeling
Sentence
selection
LSH
all
to
al
l
node
ranking
AA
1KB
13
0.008
0.09
0.021
1.73
20KB
160
0.054
1.40
0.056
1.11
45KB
366
0.138
2.28
0.139
4.82
100KB
744
0.303
5.89
0.308
9.21
644KB
5112
4.110
191.06
4.123
207.89
this
w
ork
are
timeline
and
comparati
v
e
summarization.
T
imeline
summarization
aims
at
producing
a
sequence
of
compact
summaries
for
ne
ws
sets
broadcast
at
v
arious
periods.
Comparati
v
e
Ne
ws
Summarization
aims
to
outline
the
mutualities
and
contrasts
between
comparable
ne
ws
subjects.
In
this
paper
,
we
propose
a
scalable
solution
to
multi-document
summarization
based
on
the
randomized
algorithms.
Man
y
sentence
similarity
graph
generation
algorithms
mak
e
use
of
some
distance
similarity
(e.g.,
cosine
similarity)
to
measure
pairwise
distance
between
sets
of
v
ectors
representing
corresponding
sentences.
Assume
that
we
are
gi
v
en
n
sentences
with
a
maximum
of
m
terms.
Calculating
the
full
similarity
matrix
w
ould
tak
e
time
com-
ple
xity
n
2
m
.
Man
y
no
v
el
summarization
tasks
such
as
the
comparati
v
e,
update
and
time-line
summarization
require
processing
the
lar
ge
number
of
sentences.
Ha
ving
an
n
2
m
algorithm
in
such
setups
w
ould
be
v
ery
impractical.
F
ortu-
nately
,
we
can
borro
w
some
ideas
from
the
Math
and
Theoretical
Computer
Science
to
de
v
elop
a
scalable
document
summarization
algorithm
proportional
to
nm
.
The
essence
of
our
methods
lies
in
defining
Locality
Sensiti
v
e
Hash
(LSH)
functions.
LSH
functions
in
v
olv
e
the
creation
of
short
signatures
(fingerprints)
for
each
v
ector
in
space
such
that
those
v
ectors
that
are
closer
to
each
other
are
more
lik
ely
to
ha
v
e
similar
fingerprints.
LSH
functions
are
generally
based
on
randomized
algorithms
and
are
probabilistic.
The
contrib
ution
of
this
paper
is
fourfold:
1)
P
aper
presents
a
ne
w
f
ast
sentence
selection
algorithm
able
to
scale
ef
fortlessly;
2)
W
e
describe
the
method
for
sub-linear
time
te
xt
modeling
by
means
of
sentence
similarity
graph
and
v
ery
ef
ficient
node
ranking
in
those
graphs;
3)
No
indi
vidual
part
of
our
method
is
ne
w
or
re
v
olutionary
.
Locality
sensiti
v
e
hashing
has
been
done
before,
as
ha
v
e
node
ranking
and
its
usage
in
summarization.
The
no
v
elty
is
in
the
combination
of
these
indi
vidually
useful
parts
into
a
single,
coherent,
real-time
summarization
system.
W
e
ha
v
e
not
seen
LSH
nor
our
node
ranking
implementation
applied
to
summarization
in
this
w
ay
before;
4)
W
e
e
xtensi
v
ely
e
v
aluated
our
method
on
fe
w
dif
ferent
summarization
tasks.
The
remainder
of
the
paper
is
or
g
anized
as
follo
ws:
Section
2
first
briefly
pre
sents
the
related
w
ork.
In
Section
3
we
describe
the
centerpiece
of
this
w
ork
namely
the
f
ast
document
summarization
method.
Section
4
gi
v
es
the
description
of
the
test
en
vironment
and
data
sets
we
used
for
tes
ting.
The
results
are
also
presented
i
n
Section
5.
W
e
conclude
the
paper
and
set
guidelines
for
further
w
ork
in
Section
6.
2.
RELA
TED
W
ORK
Our
w
ork
is
related
to
v
arious
research
fields
including
general
and
query
focused
summarization
[1,
2,
3,
4,
6,
5,
7],
timeline
summarization
[8,
9],
comparati
v
e
summarization
[10],
sampling-based
document
summarization
algorithms
[11],
node
ranking
of
the
sentence
similarity
graph
[12,
13,
14]
and
similarity
search
for
high
dimensional
data
objects
[15,
16].
F
ollo
wing
paragraphs
gi
v
e
a
brief
surv
e
y
of
these
w
orks.
In
[11],
authors
use
Random
Inde
xing
for
te
xt
modeling.
Random
inde
xing
presents
a
computationally
ef
ficient
w
ay
of
implicit
dimensionali
ty
reduction.
It
in
v
olv
es
ine
xpensi
v
e
v
ector
computations
such
as
addition
and
thus
pro
vides
an
ef
ficient
w
ay
to
compute
similarities
between
w
ords,
sentences
and
documents.
The
algorithm
that
we
use
in
this
paper
,
min-hash
[17],
w
as
originally
de
v
eloped
for
returning
only
the
authoritati
v
e
documents
in
search
results.
Another
closely-related
problem
is
one
kno
w
as
the
te
xt
reuse
[18].
In
contrast
to
near
-duplicate
detection,
the
focus
is
usually
on
smaller
se
gments
of
te
xt
as
opposed
to
entire
documents.
Other
similar
formulations
of
the
problem
are
what
the
data
mining
community
calls
pairwise
similarity
search
or
all
pairs
search
[19]
and
what
the
database
community
calls
set
similarity
join
[20].
IJECE
V
ol.
6,
No.
3,
June
2016:
945
–
954
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
947
Our
pre
vious
w
ork
[3,
4,
5]
present
an
e
xtracti
v
e
summarization
frame
w
ork
based
on
three
alternati
v
e
models
to
inte
grate
the
archetypal
analysis
based
sentence
selection:
(1)
the
plain
archetypal
analysis
sentence
clustering
and
ranking
for
general;
(2)
the
weighted
archetypal
analysis
sentence
selection
for
the
query
focused
document
summarization
and
(3)
the
weighted
hierarchical
archetypal
analysis
sentence
selection
for
4
dif
ferent
summarization
tasks.
T
imeline
summarization
(TS
for
short)
has
become
a
widely
adopted,
natural
w
ay
to
present
long
ne
ws
stories
in
a
compact
manner
.
Existing
approaches
for
TS
aim
to
generate
a
good
daily
summary
for
each
of
these
dates
(e.g,
[8,
9]).
In
this
study
,
we
set
our
focus
on
sho
wing
ho
w
the
presented
method
can
directly
without
e
xtra
ef
fort
be
used
in
TS
problem.
Comparati
v
e
multi
document
summarization
(CDS)
is
first
proposed
in
[10]
t
o
summarize
dif
ferences
be-
tween
comparable
document
groups.
[10]
presents
a
sentence
selection
strate
gy
modeled
by
m
eans
of
conditional
entrop
y
,
which
precisely
discriminates
the
documents
in
dif
ferent
groups.
Graph-based
methods
lik
e
T
e
xtRank
[12]
and
P
ageRank
[13]
model
a
document
or
a
set
of
documents
as
a
te
xt
similarity
graph,
constructed
by
taking
sentences
as
v
ertices
and
the
similarity
between
sentences
as
edge
weights.
The
y
tak
e
into
account
the
global
information
and
recursi
v
ely
calculate
the
sentence
significance
from
the
entire
te
xt
graph
rather
than
simply
relying
on
unconnected
indi
vidual
sentences.
From
an
NLP
perspecti
v
e,
e
xtracti
v
e
summarization
embodies
tw
o
criteria:
sentence
rele
v
ance
and
sentence
redundanc
y
.
Graph-based
sentence
ranking
algorithms
successfully
mer
ge
both
of
these
criteria
into
a
single
frame
w
ork,
by
utilizing
the
so-called
graph-based
le
xical
centrality
principle.
Graph-based
ranki
n
g
algorithms
were
also
used
in
query-focused
summarization
when
it
became
a
popular
research
topic.
F
or
instance,
a
topic-sensiti
v
e
v
ersion
of
Le
xRank
is
proposed
by
[14].
It
inte
grates
the
rele
v
ance
of
a
sentence
to
the
query
into
Le
xRank
to
get
a
biased
P
ageRank
ranking.
Similar
w
ork
to
ours
[21]
presents
a
ne
w
principled
and
v
ersatile
summarization
frame
w
ork
MDS
using
the
submodal
function.
This
frame
w
ork
can
deal
with
dif
ferent
summarization
tasks,
including
generic,
query-focused,
updated,
comparati
v
e
summarization.
The
empirical
results
sho
w
that
this
frame
w
ork
outperforms
the
other
ri
v
als
in
the
generic
summarization
and
is
competiti
v
e
in
other
summariza
tion
tasks.
In
[22]
authors
ha
v
e
in
v
estig
ated
the
use
of
maximum
entrop
y
,
nai
v
e-Bayes,
support
v
ector
machine
models
and
a
h
ybrid
machine
model
for
multi-document
automatic
te
xt
summarization.
3.
F
AST
DOCUMENT
SUMMARIZA
TION
3.1.
Moti
v
ation
The
trend
in
automatic
document
summarization
approaches
found
in
state
of
the
art
systems
proposes
a
general
summarization
methods
which
consists
of
the
follo
wing
sub-tasks:
1.
T
e
xt
modeling:
con
v
ert
the
te
xt
into,
for
instance
graph
representation
2.
Sentence
ranki
ng
:
identify
the
salient
sentences
from
gi
v
en
te
xt
model
3.
Summary
generation:
e
xtract
selected
sentences
into
final
summary
.
In
order
to
obtain
the
sentences
similarity
graph
one
needs
to
compute
the
similarity
v
alues
for
all
possible
pairs
of
sentences
in
order
to
connect
them
in
the
sentence
similarity
graph.
Mainly
the
v
ector
space
model
is
used
to
represent
sentences
from
gi
v
en
documents.
The
v
ector
space
model
is
an
algebraic
model
for
representing
sentences
as
v
ectors
of
terms.
The
cosine
similarity
is
a
measure
of
similarity
between
tw
o
v
ectors
of
an
inner
product
space
that
measures
the
cosine
of
the
angle
between
them.
Assuming
that
multiplication
and
addition
are
constant-time
operations,
the
time
comple
xity
of
computing
the
cosine
similarity
where
m
is
the
biggest
number
of
terms
is
therefore
O
(
m
)
+
O
(
m
)
=
O
(
m
)
.
But
since
we
need
to
compute
the
sentence
similarity
for
e
v
ery
pair
of
sentences
then
the
time
and
space
comple
xity
of
generating
the
sentence
similarity
graph
becomes
O
(
n
(
n
1)
=
2)
,
here
n
is
the
number
of
sentences.
In
order
to
gi
v
e
a
better
gist
of
the
time
comple
xity
we
report
the
elapsed
time
in
producing
the
summary
for
dif
ferent
document(s)
lengths
in
in
T
able
1.
Not
only
the
similarity
graph
calculation
is
time
e
xpensi
v
e,
b
ut
usually
sentence
selection
methods
are
also
v
ery
time
consuming.
Hence,
this
paper
presents
the
w
ay
for
using
the
f
ast
randomized
approximation
algorithm
(i.e.,
LSH
and
min-hash),
to
deal
with
the
quadratic
comple
xity
of
the
con
v
entional
te
xt
modeling
techniques.
Our
approximation
algorithm
utilizes
Locality-Sensiti
v
e
Hashing,
abbre
viated
as
LSH
herea
fter
,
which
is
a
probabilistic
approximation
algorithm
for
the
nearest
neighbor
search.
W
e
do
not
only
try
to
increase
the
speed
and
the
scalabilit
y
of
the
summary
production
system
on
the
te
xt
modeling
le
v
el
b
ut
we
also
present
our
contrib
ution
on
the
sentence
selection/e
xtraction
le
v
el.
F
or
te
xt
modeling
we
propose
the
follo
wing
method
(Essential
Steps
in
similarity
graph
computation):1.
N-
gram
e
xtraction:
Con
v
ert
sentences
into
sets
2.
Min-Hashing:
Con
v
ert
wide
sets
to
short
signatures,
while
preserving
similarity
3.
Locality-Sensiti
v
e
Hashing:
Concentrate
on
couples
of
signatures
probable
to
originate
from
similar
sentences
F
ast
Document
Summarization
using
Locality
Sensitive
Hashing
and
Memory
...
(Er
can
Canhasi)
Evaluation Warning : The document was created with Spire.PDF for Python.
948
ISSN:
2088-8708
D
oc
u
m
e
nt
s
Fi
n
d
in
g
s
i
mi
lar
sen
t
e
ces
w
ith
L
H
S
n
-
gr
am
e
xtrac
tion
m
i
n
hashing
m
i
n
hashing
m
i
n
hashing
l
oca
l
ity
se
n
si
ti
ve
hashi
n
g
L
HS
b
ased
sen
ten
ce
sim
il
arit
y
g
raph
as
ad
j
a
cen
cy
li
s
ts
I/O
acc
ess e
f
f
ici
en
t
n
od
e
ran
ki
n
g
S
en
ten
ce
sel
ection
an
d
or
derin
g
S
um
m
ar
y
se
t
of
stri
n
gs
of
l
engt
h
n
si
gn
a
tu
re
s:
sh
ort
i
n
te
ger
ve
ctors
r
epres
ent
i
ng
th
e
se
t
s
c
and
i
d
a
t
e
p
a
i
rs
f
or
si
m
i
la
ri
ty
graph
Of
f
[](
of
f
se
ts
f
or
nod
e
i
)
N
b
[](
p
oi
n
ter
s
t
o
out-neighbors)
n
ode-
r
anke
d
l
is
t
of
sentence
s
Figure
1.
Method
o
v
ervie
w
Gi
v
en
the
sets
of
similar
sentences
we
can
v
ery
ef
ficiently
compute
the
sentence
ranking
by
mapping
the
prob-
lem
of
sentence
selection
to
node
ranking
in
the
graph
of
similar
sentence
sets.
F
or
sentence
modeling
we
propose
the
follo
wing
method:
Essential
steps
in
I/O
ef
ficient
node
ranking
1.
Get
the
input
sentence
similarity
graph
represented
as
a
set
of
three
arrays
2.
Produce
the
sentence
ranking
by
recursi
v
ely
computing
the
eigen
v
ector
decompositions
3.
Return
the
sentence
ranking
In
the
rest
of
the
section
we
describe
these
steps
in
more
details.
3.2.
F
ast
similarity
graph
computing
In
this
subsection,
the
problem
of
sentence
similarity
is
fist
described
as
search
for
the
sets
with
a
approx-
imately
big
intersection.
W
e
then
sho
w
ho
w
the
problem
of
finding
te
xtually
similar
s
entences
can
be
turned
into
such
a
set
problem
by
representing
gi
v
en
te
xt
entities
as
a
set
of
n-grams.
Then,
we
present
ho
w
method
kno
wn
as
min-hashing
can
be
used
for
shortening
these
huge
n-gram
sets
while
preserving
the
similarity
information
of
the
underlying
sets.
And
finally
we
utilize
the
locality-sensiti
v
e
hashing
for
adjusting
our
search
on
couple
of
sentences
that
are
most
probable
to
be
similar
.
Let
us
assume
the
similarity
of
the
pair
of
sentences
can
be
deduced
by
barely
looking
at
the
relati
v
e
size
of
their
intersection.
This
is
the
similarity
meas
u
r
e
kno
w
as
Jaccard
similarity
.
If
the
Jaccard
similarity
of
sets
W
and
Z
is
j
W
\
Z
j
=
j
W
[
Z
j
,
than
the
Jaccard
similarity
of
sentences
S
1
and
S
2
can
be
denoted
as
S
I
M
(
S
1
;
S
2
)
.
The
form
of
similarity
we
are
utilizing
here
is
character
-le
v
el
similarity
.
A
v
ery
simple
b
ut
producti
v
e
method
for
representing
sentences
as
sets
is
to
describe
them
as
the
sets
of
v
ery
short
strings
that
occur
within
sentences.
In
this
w
ay
sentences
that
share
pieces
as
short
as
w
ords
or
e
v
en
syllable
will
ha
v
e
man
y
common
elements
in
their
sets,
e
v
en
if
those
common
entities
appear
in
dif
ferent
orders
in
the
tw
o
sentences.
Define
a
n-gram
for
a
sentence
to
be
an
y
substring
of
length
n
found
within
the
sentence.
Then,
correlate
each
sentence
with
the
set
of
n-grams
that
appear
one
or
more
times
wi
thin
that
sentence.
Instead
of
manipulating
the
sub-strings
as
n-grams,
we
choose
a
hash
function
that
maps
them
to
some
num
b
e
r
of
b
uck
ets
and
handle
the
final
b
uck
et
number
as
the
n-gram.
The
set
defining
a
sentence
e
v
entually
becomes
a
set
of
inte
gers
that
are
b
uck
et
numbers
of
one
or
more
n-grams
that
appear
in
the
sentence.
In
this
w
ay
we
drastically
compress
the
original
te
xtual
data.
IJECE
V
ol.
6,
No.
3,
June
2016:
945
–
954
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
949
Algorithm
1
F
ast
min-hashing
algorithm
1:
pr
ocedur
e
M
I
N
H
A
S
H
I
N
G
(
S
[]
;
H
[]
;
n;
k
)
2:
Input:
S
(Set
of
n-grams),
H
(
N
random
hash
functions),
n
(number
of
n-grams),
k
(number
of
hash
func-
tions);
3:
Output:
c
set
of
min-hash
signatures
of
the
input
set
of
n-grams
S
;
4:
c
[]
ne
w
int
[
n
]
5:
f
or
i
=
0
to
n
do
6:
c
[
i
]
1
7:
f
or
i
=
1
to
n
do
8:
if
S(i)==1
then
9:
f
or
i
=
0
to
k
do
10:
if
h
j
(
i
)
==
c
j
then
11:
c
[
i
]
1
12:
end
pr
ocedur
e
Algorithm
2
Approximate
nearest
neighbor
search
1:
pr
ocedur
e
L
S
H
(
M
[
;
]
;
s;
b;
r
)
2:
Input:
M
(min-hash
signature
matrix),
s
(similarity
threshold),
b
the
number
of
bands,
r
the
number
of
ro
ws;
3:
Output:
F
set
of
documents
with
jaccard
similarity
at
least
s
;
4:
Di
vide
matrix
M
into
b
bands
of
r
ro
ws
5:
f
or
each
b
in
band
do
hash
b
portion
of
each
column
to
a
hash
table
with
k
b
uck
ets;
mak
e
k
as
lar
ge
as
possible
6:
end
f
or
7:
Candidate
column
pairs
are
those
that
hash
to
the
same
b
uck
et
for
1
band
8:
T
une
b
and
r
to
catch
similar
sentences
9:
end
pr
ocedur
e
Since
the
sets
of
n-grams
are
usually
lar
ge,
one
can
replace
them
by
much
smaller
representations
called
signatures.
Signatures
can
be
calculated
using
the
method
kno
wn
as
the
min-hashing,
briefly
gi
v
en
in
Algorithm
1.
This
technique
is
de
v
eloped
to
guarantee
that
tw
o
similar
objects
generate
hashes
that
are
themselv
es
similar
.
In
f
act,
the
similarity
of
the
hashes
has
a
direct
relationship
to
the
similarit
y
of
the
sentences
the
y
were
generated
from.
This
ratio
temps
to
approximate
the
Jaccard
Similarity
.
Although
we
use
min-hashing
to
compress
lar
ge
sets
into
small
signat
ures
while
yet
preserving
the
e
xpected
similarity
of
an
y
pair
of
sentences,
there
is
still
another
v
ery
important
issue.
Finding
the
pairs
of
sentences
with
greatest
similarity
ef
ficiently
can
be
v
ery
time
consuming.
The
reason
is
that
the
number
of
pairs
of
sentences
may
be
too
lar
ge.
The
brute-force
approach
w
ould
be
to
com
pare
each
sentence
with
each
other
sentence,
using
MinHash,
which
ob
viously
has
the
quadratic
time
comple
xity
.
A
f
aster
solution
is
to
use
Locality
Sensiti
v
e
Hashing
(LSH).
This
tak
es
the
MinHash
v
alues
for
sentences
and
hashes
the
MinHash
v
alues
so
the
y
hash
into
the
same
b
uck
et
if
the
y
are
similar
.
The
brief
algorithm
is
described
in
2.
Note
that
the
computation
time
for
LSH
with
MinHash
depends
only
on
the
number
of
sentences
and
number
of
MinHash
functions
used
and
not
on
the
length
of
the
sentences.
W
e
can
no
w
gi
v
e
an
approach
to
finding
the
set
of
candidate
pairs
for
similar
sentences
and
then
disco
v
ering
the
truly
similar
sentences
among
them:
1.
Pick
a
v
alue
of
n
and
construct
from
each
sentence
the
set
of
n-grams.
2.
hash
the
n-grams
to
shorter
b
uck
et
numbers.
3.
Sort
the
sentence
and
n-grams
pairs
to
order
them
by
latter
.
4.
Pick
a
length
n
for
the
minhash
signatures.
Feed
the
sorted
list
to
the
algorithm
1
to
compute
the
minhash
signatures
for
all
the
sentences.
5.
Choose
a
threshold
t
that
defines
ho
w
similar
sentences
ha
v
e
to
be
in
order
for
them
to
be
re
g
arded
as
a
desired
”similar
pair
.
”
Pick
a
number
of
bands
b
and
a
number
of
ro
ws
r
such
that
b
r
=
n
,
and
the
threshold
t
is
approximately
(1
=b
)1
=r
.
6.
Construct
candidate
pairs
by
applying
the
LSH
technique
described
in
algorithm
2.
F
ast
Document
Summarization
using
Locality
Sensitive
Hashing
and
Memory
...
(Er
can
Canhasi)
Evaluation Warning : The document was created with Spire.PDF for Python.
950
ISSN:
2088-8708
7.
Connect
the
sentences
in
the
similarity
graph
based
on
the
candidate
pairs
calculated
by
LHS.
3.3.
Sentence
selection
In
this
subsection
we
describe
an
I/O
ef
ficient
graph
based
ranking
method
for
sentence
selection
from
the
graph
of
sentences.
The
construction
methodology
of
graph
w
as
presented
in
pre
vious
subsection.
The
idea
has
been
v
astly
used
in
document
summarization
since
the
pioneering
w
orks
kno
wn
as
P
ageRank
[pagerank],
and
te
xtrank
[te
xtrank].
T
o
ef
ficiently
compute
the
P
ageRank
scores
for
a
big
graphs
,
the
input
sentence
similarity
graph
has
to
be
represented
as
a
binary
l
ink
structure,
more
specifically
as
a
set
of
three
arrays:
S
enL
(list
of
the
n
sentences),
O
f
f
(array
of
inte
gers
which
denotes
the
of
fsets
of
list
for
node
i
)
N
b
(array
of
inte
gers
which
denotes
the
pointers
to
out-
neighbors);
Using
the
abo
v
e
structure,
a
simple
I/O
ef
ficient
P
ageRank
algorithm
can
be
written
in
Algorithm
3.
Note
that
e
xcept
for
new
pr
[]
array
,
which
represents
the
P
ageRank
v
alues,
all
arrays
are
scanned
only
once
sequentially
from
front
to
end.
Gi
v
en
the
node
ranking
our
summarization
approach
will
e
xtract
the
most
important
nodes,
i.e
sentences,
to
include
in
the
summary
.
Here
an
additional
sentence
penalization
step
is
applied.
Suppose
x
i
is
the
highest
rank
ed
sentence.
Sentence
x
i
is
mo
v
ed
to
set
of
sentences
representing
the
final
summary
,
and
then
the
redundanc
y
penalty
is
imposed
to
the
o
v
erall
rank
score
of
each
sentence
link
ed
with
x
i
as
follo
ws:
for
each
sentence
x
j
,
its
rank
score
R
S
cor
e
(
x
j
)
is
computed
by:
R
S
cor
e
(
x
j
)
=
R
S
cor
e
(
x
j
)
(1
S
im
j
i
)
t
;
t
>
0
is
the
e
xponent
decay
f
actor
.
The
lar
ger
t
is,
the
greater
penalty
is
imposed
to
the
o
v
erall
rank
score.
If
t
=
0
,
no
di
v
ersity
penalty
is
imposed
at
all;
In
our
e
xperiments
we
set
t
=
3
;
Presented
penaliza
tion
algorithm
is
based
on
the
idea
that
e
xtracting
the
o
v
erall
rank
score
of
less
informati
v
e
sentences
o
v
erlapping
with
the
sentences
in
summary
is
iterati
v
ely
decreased.
Here,
redundanc
y
remo
v
al
is
also
the
k
e
y
step
of
content
selection.
Finally
,
the
sentence
with
the
highest
rank
score
is
chosen
to
produce
the
summary
until
satisfying
the
summary
length
limit.
Algorithm
3
I/O
ef
ficient
node
ranking
1:
pr
ocedur
e
S
E
N
T
E
N
C
E
R
A
N
K
I
N
G
(
S
enL;
O
f
f
;
N
b
)
2:
Input:
S
enL
(list
of
the
n
sentences),
O
f
f
(array
of
inte
gers
which
denotes
the
of
fsets
of
list
for
node
i
)
N
b
(array
of
inte
gers
which
denotes
the
pointers
to
out-neighbors);
3:
Output:
node-rank
ed
list
of
sentences;
4:
n
S
enL:C
ount
5:
0
:
15
6:
m
10
7:
pr
[]
new
f
l
oat
[
n
]
8:
new
pr
[]
new
f
l
oat
[
n
]
9:
f
or
i
=
0
to
n
do
10:
pr
[
i
]
1
=n
11:
new
pr
[
i
]
(1
)
=n
12:
f
or
k
=
0
to
m
do
13:
f
or
i
=
0
to
O
f
f
:C
ount
1
do
14:
outd
O
f
f
[
i
+
1]
O
f
f
[
i
]
15:
f
or
j
=
O
f
f
[
i
]
to
(
O
f
f
[
i
+
1]
1)
do
16:
new
pr
[
N
B
[
j
]]+
=
(
beta
(
pr
[
i
]
=outd
))
17:
new
pr
[
i
]
(1
)
=n
18:
end
pr
ocedur
e
4.
EXPERIMENTS
In
this
section,
we
conduct
e
xperim
ents
to
e
v
aluate
the
ef
fecti
v
eness
and
possible
positi
v
e
contrib
utions
of
the
proposed
method
compared
with
other
e
xisting
summarization
systems
on
fe
w
dif
ferent
summarization
tasks
including
General/Query
,
Comparati
v
e,
and
T
imeline
summarization.
IJECE
V
ol.
6,
No.
3,
June
2016:
945
–
954
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
951
4.1.
Ev
aluation
Metric
W
e
used
the
metric
based
on
the
R
OUGE
scores
that
are
widely
used
in
traditional
summarization
tasks.
Recall
Oriented
Understudy
for
Gisting
Ev
aluation
(R
OUGE)
e
v
aluation
package
[23],
compares
v
arious
summary
results
from
se
v
eral
summarization
methods
with
summaries
generated
by
humans.
In
timeline
e
v
aluation
tasks,
the
quality
of
dif
ferent
TSs
are
compared
via
F-measure
of
the
R
OUGE-1,
R
OUGE-2.
In
this
paper
,
we
adopted
the
same
metrics,
plus
the
additional
R
OUGE-S*.
T
echnically
,
R
OGUE-S*
is
computed
the
same
as
bigram-based
R
OGUE-
2
scores,
b
ut
it
allo
ws
the
w
ords
in
the
bigram
to
be
aparted
by
a
wi
nd
o
w
.
This
mak
es
R
OGUE-S*
capture
better
the
global
distrib
utional
semantics,
whil
e
traditional
R
OGUE-Ns
capture
better
the
local
semantics,
i.e.
s
entence
to
sentence
matching.
The
set
of
input
parameters
for
F
astSum
namely
1)
the
number
of
ngrams;
2)
the
number
of
bands;
3)
the
number
of
ro
ws;
4)
and
the
number
of
elements
are
separately
defined
for
each
kind
of
treated
summarization
task
as
reported
in
T
able1.
The
rational
for
picking
up
these
v
alues
are
purely
empi
rical
and
are
based
on
the
e
xperiments
presented
in
the
rest
of
the
paper
.
T
able
2.
F
astSum
Input
parameters
T
ask
#ngrams
#bands
#ro
ws
#elements
General/Query
4
20
2
40
T
imeline
4
20
3
60
Comparati
v
e
6
20
5
100
4.2.
General
summarization
W
e
use
the
DUC05
and
DUC06
data
sets
to
e
v
aluate
our
proposed
method
empirically
on
general
and
query
focused
summarization
tasks.
Benchmark
data
sets
are
from
DUC
1
for
automatic
summarization
e
v
aluation.
DUC05
and
DUC06
data
sets
consist
of
50
topics.
The
task
is
to
c
reate
a
summary
of
no
more
than
250
w
ords.
In
those
document
data
sets,
stop
w
ords
were
remo
v
ed
using
the
stop
list
2
and
the
terms
were
stemmed
using
the
Porter’
s
scheme
3
,
which
is
a
commonly
used
algorithm
for
w
ord
stemming
in
English.
T
able
3.
Ev
aluation
of
the
methods
on
the
DUC2005
dataset.
Summarizers
R
OUGE-1
R
OUGE-2
R
OUGE-SU4
A
vg-Human
0.4417
(1)
0.1023
(1)
0.1622
(1)
A
vg-DUC05
0.3434
(7)
0.0602
(6)
0.1148
(7)
System-15
0.3751
(4)
0.0725
(4)
0.1316
(4)
System-4
0.3748
(5)
0.0685
(5)
0.1277
(5)
Biased-Le
x
0.3861
(3)
0.0753
(3)
0.1363
(3)
wAASum
0.3945
(2)
0.0797
(2)
0.1420
(2)
F
astSum
0.3697
(6)
0.0506
(7)
0.1168
(6)
W
e
w
ork
with
the
follo
wing
methods
for
general/query-focused
summarization
as
the
baseline
systems
to
compare
with
our
proposed
method:
(1)
A
vg-Human:
the
a
v
erage
human
summarizer
on
DUC05(06);
(2)
A
vg-DUC05(06):
the
a
v
erage
system
summarizer;
(3)
System-15(24):
The
bes
t
system-summarizer
from
DUC05(06);
(4)
System-4(12):
The
second
best
system
summarizer
from
DUC04(05);
(5)
Le
x-P
ageRank:
by
calculating
the
eigen
v
ector
centr
ality
gi
v
en
the
sentence
to
sentence
similarity
graph
the
method
e
xtracts
the
most
significant
sentences;
(6)
wAASum:
weighted
Archetypal
analysis
summarization
system
of
the
sentence
similarity
graph;
(7)
F
astSum:
the
method
presented
by
this
paper
.
Although
there
are,
for
each
year
,
more
than
30
systems
that
ha
v
e
participated
in
DUC
competition,
here
we
only
compare
with
the
DUC
human
best,
the
DUC
human
a
v
erage,
the
DUC
system
best
and
the
DUC
system
a
v
erage
result.
1
http://duc.nist.go
v
2
ftp://ftp.cs.cornell.edu/pub/smart/english.stop
3
http://www
.tartarus.or
g/martin/PorterStemmer/
F
ast
Document
Summarization
using
Locality
Sensitive
Hashing
and
Memory
...
(Er
can
Canhasi)
Evaluation Warning : The document was created with Spire.PDF for Python.
952
ISSN:
2088-8708
T
able
4.
Ev
aluation
of
the
methods
on
the
DUC2006
dataset.
Summarizers
R
OUGE-1
R
OUGE-2
R
OUGE-SU4
A
vg-Human
0.4576
(1)
0.1149
(1)
0.1706
(1)
A
vg-DUC06
0.3795
(7)
0.0754
(6)
0.1321
(7)
System-24
0.4102
(3)
0.0951
(2)
0.1546
(4)
System-12
0.4049
(5)
0.0899
(4)
0.1476
(5)
Biased-Le
x
0.3899
(6)
0.0856
(5)
0.1394
(6)
wAASum
0.4238
(2)
0.0917
(3)
0.1671
(2)
F
astSum
0.4086
(4)
0.0710
(7)
0.1616
(3)
T
able
5.
Results
in
comparati
v
e
summarization:
Sentences
selected
by
our
proposed
F
astSum
approach.
ID
Selected
sentence
1
At
the
Madrid
summit
last
December
,
leaders
of
EU
member
nations
agreed
unanimously
that
the
European
single
currenc
y
will
be
formally
launched
on
January
1,
1999.
2
The
W
a
National
Or
g
anization,
the
P
alaung
State
Liberation
Front
and
the
Lahu
Democratic
Front
said
the
arrest
were
an
insulting
act
of
shameless,
barbaric
arrog
ance
ag
ainst
the
people
of
Burma.
3
ET
A,
which
stands
for
Basque
Homeland
and
Freedom,
has
killed
nearly
800
people
in
its
30-year
campaign
for
an
independent
Basque
nation
carv
ed
out
of
parts
of
northern
Spain
and
southern
France.
4
Unemplo
yment
in
France
fell
to
11.6
percent
in
October
from
11.7
percent,
reflecting
a
slo
w
b
ut
steady
impro
v
ement
in
France’
s
high
jobless
rate,
long
one
of
the
nation’
s
knottiest
problems.
5
Researchers
e
v
aluate
o
v
erweight
and
obesity
using
a
measure
called
body
mass
inde
x
BMI
,
which
di
vides
weight
in
kilograms
by
the
square
of
height
in
meters.
T
ables
3
and
4
sho
w
the
R
OUGE
scores
of
dif
ferent
methods
using
DUC05
and
DUC06
data
sets,
respec-
ti
v
ely
.
The
higher
R
OUGE
score
indicates
the
better
summarization
performance.
The
number
in
parentheses
in
each
table
slot
sho
ws
the
ranking
of
each
method
on
a
specific
data
set.
Ev
en
thought
our
results
are
not
among
the
best
we
sho
w
that
by
sacrificing
a
rather
small
percentage
of
recall
and
precision
in
the
quality
of
the
produced
summary
can
reduce
the
quadratic
to
sub-linear
time
comple
xity
of
other
typical
summarization
systems.
4.3.
Comparati
v
e
summarization
In
this
section
we
in
v
estig
ate
one
of
the
recent
summarization
tasks,
first
proposed
by
[10].
W
e
model
the
comparati
v
e
summarization
as
follo
ws:
Extract
the
summaries
S
1
;
S
2
;
:::;
S
N
from
the
gi
v
en
N
groups
of
documents
G
1
;
G
2
;
:::;
G
N
.
Extracted
summaries
should
be
as
di
v
er
gent
as
possible
from
one
another
on
the
topic
le
v
el
while
still
e
xpressing
central
themes
of
corresponding
groups.
W
e
propose
a
follo
wing
function
for
the
comparati
v
e
summ
arization
to
generate
the
discriminant
summary
for
each
group
of
documents:
C
s
=
"
(
i;
j
)
j
(
sim
nor
m
(
s
i
;
s
j
)
if
G
(
s
i
)
=
G
(
s
j
)
sim
nor
m
(
s
i
;
s
j
)
if
G
(
s
i
)
6
=
G
(
s
j
)
#
t
t
(1)
where
G
(
s
i
)
is
the
document
group
containing
sentence
s
i
,
sim
nor
m
(
s
i
;
s
j
)
is
the
normalized
sentence
similarity
.
Evaluation:
Since
there
is
currently
no
dataset/methodology
a
v
ailable
to
carry
out
a
quantitati
v
e
e
v
aluation
of
comparati
v
e
summarization
we
used
fi
v
e
clusters
of
documents
from
the
DUC07
corpora
to
generate
comparati
v
e
summaries
using
the
F
astSum
method.
The
data
set
contains
fi
v
e
clusters
as
follo
ws:
1.
Steps
to
w
ard
introduction
of
the
Euro;
2.
Burma
go
v
ernment
change
1988;
3.
Basque
separatist
mo
v
ement
1996-2000.
4.
Unemplo
yment
in
France
in
the
1990s;
5.
Obesity
in
the
United
States
and
possible
causes
for
US
obesity;
Looking
at
the
results
by
F
astSum
sentence
selection
method
in
T
able
5,
each
of
the
sentences
represents
one
cluster
respecti
v
ely
and
summarizes
well
specific
topics
of
each
cluster
.
In
T
able
5,
we
also
highlight
some
k
e
yw
ords
IJECE
V
ol.
6,
No.
3,
June
2016:
945
–
954
Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE
ISSN:
2088-8708
953
T
able
6.
A
v
erage
results
on
17
timelines,
the
reported
results
are
computed
95%
confidence
interv
al
Summarizers
R
OUGE-1
R
OUGE-2
R
OUGE-SU4
Random
0.128
0.021
0.026
Chieu
et
al.
0.202
0.037
0.041
T
ran
et
al.
0.230
0.053
0.050
F
astSum
0.197
0.032
0.039
representing
the
unique
features
of
each
topic.
Note
that
the
sentence
e
xtracted
by
F
astSum
for
each
topic
are
not
just
discriminati
v
e
b
ut
the
y
also
present
the
essence
of
the
topic.
4.4.
T
imeline
summarization
In
order
to
e
v
aluate
out
method
on
timeline
summarization
task
we
used
T
imeline17
dataset
[8].
Briefly
,
the
dataset
consists
of
17
manual-created
timelines
and
their
associated
ne
ws
articles.
Data
set
are
published
online
4
.
W
e
e
v
aluate
our
syst
em
ag
ainst
traditional
multi
document
summarization
and
timeline
generation
systems.
F
ollo
wing
is
the
list
of
those
systems:
Random:
The
system
generates
day
summary
for
each
day
by
randomly
selecting
sentences
for
particular
day
.
[9]
is
multi-document
summarizer
which
utilizes
the
popularity
of
a
sentence
as
TFIDF
similarity
with
other
sentences
to
estimate
its
importance.
[8]
The
y
use
SVM-rank
to
demonstrate
the
performance
of
their
system,
which
is
one
of
the
most
common
learn
to
rank
implementa‘tions.
Result:
The
a
v
erage
results
of
TS
generation
on
gi
v
en
dataset
are
represented
in
T
able
6.
Although
when
compared
to
other
tw
o
systems
ours
seems
to
perform
more
poorly
this
is
mainly
due
to
its
simplicity
which
is
payed
by
its
scalability
.
5.
CONCLUSION
AND
FUTURE
W
ORK
A
particular
challenge
for
graph
based
multi-document
summarization
methods
is
a
lar
ge
time
comple
xity
of
at
least
O
(
n
2
)
for
te
xt
modeling
and
some
additional
comple
xity
of
sentence
selection.
Hence
we
need
ef
fecti
v
e
summarization
methods
to
reduce
this
high
time
comple
xity
.
In
this
paper
we
ha
v
e
formalized
the
problem
of
the
f
ast
and
scalable
document
summarization
method
as
combination
of
(1)
the
te
xt
modeling
sub-problem
of
calculating
the
similarity
graph
based
on
locality
sensiti
v
e
hashing
and
(2)
the
sentence
selection
sub-problem
of
I/O
access
ef
ficient
node
ranking.
The
contrib
ution
of
the
w
ork
can
be
summarized
as:
1.
The
paper
presents
a
ne
w
f
ast
sentence
selection
algorithm
able
to
scale
ef
fortlessly
.
2.
W
e
describe
the
method
for
sub-linear
time
te
xt
modeling
by
means
of
sentence
similarity
graph
and
v
ery
ef
ficient
node
ranking
in
those
graphs.
3.
No
indi
vidual
part
of
our
method
is
ne
w
or
re
v
olutionary
.
Locality
sensiti
v
e
hashing
has
been
done
before,
as
ha
v
e
node
ranking
and
its
usage
in
summarization.
The
no
v
elty
is
in
the
combination
of
these
indi
vidually
useful
parts
into
a
single,
coherent,
real-time
summarization
system.
W
e
ha
v
e
not
seen
LSH
nor
our
node
ranking
implementation
applied
to
summarization
in
this
w
ay
before;
4.
W
e
e
xtensi
v
ely
e
v
aluated
our
method
on
fe
w
dif
ferent
summarization
tasks.
In
future
the
F
astSum
may
be
further
impro
v
ed.
There
are
man
y
potential
directions
for
impro
v
ements,
such
as:
1.
impro
ving
the
F
astSum
into
a
distrib
uted
real-time
multi-document
summarization
system
2.
adopting
F
astSum
to
and
testing
it
as
the
system
capable
of
scaling
to
man
y
serv
ers
and
huge
size
of
documents
3.
in
order
to
impro
v
e
the
quality
of
produced
summaries
one
can
enhance
the
sentence
similarity
calculation
by
using
the
w
ordnet
and
by
adopting
the
LSH
to
f
ast
semantic
similarity
calculation.
6.
SOURCE
CODE
All
the
source
codes
can
be
do
wnloaded
as
SVN
check
out
at:
https://github.com/ErcanCanhasi/FastDocumentSummarization.git
REFERENCES
[1]
P
ankaj
B,
Agra
w
al
AJ.
(2014)
Extracti
v
e
Based
Single
Document
T
e
xt
Summarization
Using
Clustering
Ap-
proach.
In:
IAES
International
J
ournal
of
Artificial
Intellig
ence
(IJ-AI)
2014;
3(2)
.
4
http://www
.l3s.de/˜gtran/timeline/
F
ast
Document
Summarization
using
Locality
Sensitive
Hashing
and
Memory
...
(Er
can
Canhasi)
Evaluation Warning : The document was created with Spire.PDF for Python.
954
ISSN:
2088-8708
[2]
Pedram
V
A,
Omid
SSh.
Scientific
Documents
clustering
based
on
T
e
xt
Summarization.
In:
International
J
ournal
of
Electrical
and
Computer
Engineering
(IJECE)
2015;
5(4):
782–787
.
[3]
Canhasi
E,
K
ononenk
o
I.
Multi-document
summarization
via
Archetypal
Analysis
of
the
content-graph
joint
model.
Knowledg
e
and
Information
Systems
(KAIS)
,
2014;
41(3):
821–842.
[4]
Canhasi
E,
K
ononenk
o
I.
W
eighted
archetypal
analysis
of
the
multi-element
graph
for
query-focused
multi-
document
summarization.
Expert
Systems
with
Applications
(ESW
A)
,
2014;
41(2):
535–543.
[5]
Canhasi,
E.,
K
ononenk
o,
I.,
W
eighted
hierarchical
archetypal
analysis
for
multi-document
summarization.
Com-
put.
Speech
Lang.
(2015),
http://dx.doi.or
g/10.1016/j.csl.2015.11.004
[6]
Canhasi
E,
K
ononenk
o
I.
Automatic
Extracti
v
e
Multi-document
Summarization
Based
on
Archetypal
Analysis.
Non-ne
gative
Matrix
F
actorization
T
ec
hniques.
Spring
er
Berlin
Heidelber
g
,
2016;
75-88.
[7]
Dipti
YS.
Ef
fect
of
feature
selection
on
small
and
lar
ge
document
summarization.
In:
IAES
International
J
ournal
of
Artificial
Intellig
ence
(IJ-AI)
2014;
3(3)
.
[8]
T
ran
GB,
T
ran
A
T
,
T
ran
NK,
Alrif
ai
M,
Kanhab
ua
N.
Le
v
eraging
Learning
T
o
Rank
in
an
Optimization
Frame
w
ork
for
T
imeline
Summarization.
2013
[9]
Chieu
HL,
Lee
YK.
Query
based
e
v
ent
e
xtract
ion
along
a
timeline.
In:
Pr
oceedings
of
the
27th
annual
interna-
tional
A
CM
SIGIR
confer
ence
on
Resear
c
h
and
de
velopment
in
information
r
etrie
val,
A
CM.
2004;
425–432.
[10]
W
ang
D,
ZhuS
L,
Gong
TY
.
Comparati
v
e
document
summarization
via
discriminati
v
e
sentence
selection,
TKDD
2012;
6(3):
1–12.
[11]
Chatterjee
N,
Mohan
S.
Extraction-based
single-document
summarization
using
random
inde
xing.
In:
T
ools
with
Artificial
Intellig
ence
,
ICT
AI
19th
IEEE
International
Confer
ence
2007;
448–455.
[12]
Mihalcea
R,
T
arau
P
.
T
e
xtRank:
Bringing
Order
into
T
e
xts
In:
Pr
oceedings
of
Confer
ence
on
Empirical
Methods
in
Natur
al
Langua
g
e
Pr
ocessing
,
EMNLP
,
A
CL
2004;
404–411.
[13]
Erkan
G,
Rade
v
DR.
Le
xRank:
Graph-based
le
xical
centrality
as
salience
in
te
xt
summarization.
J
ournal
of
Artificial
Intellig
ence
Resear
c
h
,
2004;
457–479.
[14]
Otterbacher
J,
Erkan
G,
Rade
v
DR.
Biased
Le
xRank:
P
assage
retrie
v
al
using
random
w
alks
with
question-based
priors.
Information
Pr
ocessing
and
Mana
g
ement
,
2009;
45(1):
42-54.
[15]
Andoni
A,
Indyk
P
.
Near
-optimal
hashing
algorithms
for
approximate
nearest
neighbor
in
high
dimensions.
In:
F
oundations
of
Computer
Science
FOCS’06.
47th
Annual
IEEE
Symposium
on,
(IEEE)
2006;
459–468.
[16]
Henzinger
M.
Finding
ne
ar
-duplicate
web
pages:
a
lar
ge-scale
e
v
aluation
of
algorithms.
In:
Pr
oceedings
of
the
29th
annual
international
A
CM
SIGIR
confer
ence
on
Resear
c
h
and
de
velopment
in
information
r
etrie
val,
(A
CM
SIGIR)
2006;
284–291.
[17]
Broder
AZ,
On
the
resemblance
and
containment
of
documents.
In:
Compr
ession
and
Comple
xity
of
Sequences
1997.
Pr
oceedings,
(IEEE)
1997;
21–29.
[18]
Bendersk
y
M,
Croft
WB.
Finding
te
xt
reuse
on
the
web
.
In:
Pr
oceedings
of
t
he
Second
A
CM
International
Confer
ence
on
W
eb
Sear
c
h
and
Data
Mining
,
(A
CM)
2009;
262–271.
[19]
Bayardo
RJ,
Ma
Y
,
Srikant
R.
Scaling
up
all
pairs
similarity
search.
In:
Pr
oceedings
of
t
he
16th
international
confer
ence
on
W
orld
W
ide
W
eb,
(A
CM)
2007;
131–140.
[20]
V
ernica
R,
Care
y
MJ,
Li
C.
Ef
ficient
parallel
set-similarity
joins
using
MapReduce
In:
Pr
oceedings
of
the
2010
A
CM
SIGMOD
International
Confer
ence
on
Mana
g
ement
of
data
,
A
CM
2010;
495–506.
[21]
Li
J,
Li
L,
Li
T
,
Multi-document
summarization
via
submodularity
.
Applied
Intelligence,
2014;
37(3);
420–430.
[22]
F
attah
MA.
A
h
ybrid
machine
learning
model
for
multi-document
summarization.
Applied
intelligence
2014;
40(4):
592–600.
[23]
Lin
CY
.
Rouge:
a
package
for
automatic
e
v
aluation
of
summaries.
In:
T
e
xt
summarization
br
anc
hes
out:
pr
o-
ceedings
of
the
A
CL-04
workshop
of
A
CL
2004;
74-81.
BIOGRAPHY
OF
A
UTHOR
Er
can
Canhasi
recei
v
ed
his
Ph.D.
in
2014
from
Uni
v
ersity
of
Ljubljana,
Slo
v
enia.
He
is
a
assistant
professor
at
the
F
aculty
of
Computer
Science
i
n
Prizren.
His
research
interests
include
te
xt
mining,
natural
language
processing
and
te
xt
summarization.
He
is
the
(co)author
of
fe
w
papers.
Further
info
on
his
homepage:
https://sites.google.com/a/uni-prizren.com/ercancanhasi/
IJECE
V
ol.
6,
No.
3,
June
2016:
945
–
954
Evaluation Warning : The document was created with Spire.PDF for Python.