Inter national J our nal of Electrical and Computer Engineering (IJECE) V ol. 6, No. 3, June 2016, pp. 945 954 ISSN: 2088-8708, DOI: 10.11591/ijece.v6i3.9030 945       I ns t it u t e  o f  A d v a nce d  Eng ine e r i ng  a nd  S cie nce   w     w     w       i                       l       c       m     F ast Document Summarization using Locality Sensiti v e Hashing and Memory Access Efficient Node Ranking Er can Canhasi * * F aculty of Computer Science, Uni v ersity of Prizren Article Inf o Article history: Recei v ed Feb 5, 2016 Re vised May 7, 2016 Accepted May 19, 2016 K eyw ord: Document summarization Locality Sensiti v e Hashing I/O Access Ef ficient Node Ranking Min-Hashing T imeline summarization Comparati v e summarization ABSTRA CT T e xt modeling and sentence selection are the fundamental steps of a typical e xtracti v e doc- ument summarization algorithm. The common te xt modeling method connects a pair of sentences based on their simi larities. Ev en thought it can ef fecti v ely represent the sentence similarity graph of gi v en document(s) its big dra wback is a lar ge time comple xity of O ( n 2 ) , where n represents the number of sentences . The quadratic time comple xity mak es it im- practical for lar ge documents. In this paper we propose the f ast approximat ion algorithms for the te xt modeling and the sentence selection. Our te xt modeling algorithm reduce s the time comple xity to near -linear time by rapidly finding the most similar sentences to form the sentences similarity graph. In doing so we utili zed Locality-Sensiti v e Hashi ng, a f ast algorithm for the approximate nearest neighbor search. F or the sentence selection step we propose a simple memory-access-ef ficient node ranking method based on the idea of scan- ning sequentially only the neighborhood arrays. Experimentally , we sho w that sacrificing a rather small percentage of recall and precision in the quality of the produced summary can reduce the quadratic to sub-linear time comple xity . W e see the big potential of proposed method in te xt summarization for mobile de vices and big te xt data summarization for in- ternet of things on cloud. In our e xperiments, beside e v aluating the presented method on the standard general and query multi-document summa rization tasks, we also tested it on fe w alternati v e summarization tasks including general and query , timeline, and comparati v e summarization. Copyright c 2016 Institute of Advanced Engineering and Science . All rights r eserved. Corresponding A uthor: Ercan Canhasi F aculty of Computer Science, Uni v ersity of Prizren Rrug a e Shkronja v e, nr . 1 20000 Prizren, K oso v e ercan.canhasi@uni-prizren.com 1. INTR ODUCTION The w orld is floating on data. These data are mainly coming from the w ord wide web which is e xpanding e xponentially making massi v e v olume of the online information a v aila b l e for users. F or years we ha v e been af fected by the quantity of data streaming through and produced by our systems. Automatic document summarization as the complementary tool to re gular web search engines can be used to scale do wn this problem of information o v erload. Since the most of mobile and interacti v e ubiquitous multimedia de vices ha v e restricted hardw are such as CPU, mem- ory , and display screen it is essential to compress an document collection to a brief s ummary before it is del i v ered to the user of these de vices. Other t echnology trends which can lar gely benefit from a scalable document summarization methods are the Internet of Things (IoT) on Cloud and Big data. The later is about the processing and analysis of lar ge data repositories on Cloud computing. Big document summarization method is an important technique for data management of IoT . T raditional document summarization methods are restricted to summarize suitable information from the e xploding IoT big data on Cloud. The main task of the e xtraction based multi-document summarization is to e xtract the most important sen- tences from multiple documents and format them into a summary . According to the number of documents to be summarized, the summary can be a single document or a multi-document. Query-focused multi-document summa- rization is a special case of multi-document summarization. Gi v en a query , the task is to produce a summary which can respond to the information required by the query . T w o other specific document summarization tasks treated in J ournal Homepage: http://iaesjournal.com/online/inde x.php/IJECE       I ns t it u t e  o f  A d v a nce d  Eng ine e r i ng  a nd  S cie nce   w     w     w       i                       l       c       m     Evaluation Warning : The document was created with Spire.PDF for Python.
946 ISSN: 2088-8708 T able 1. Observ ed e x ecution time spent on calculations (in seconds). The time elapsed in computing the summaries are measured on processor with follo wing specifications: Intel(R) Core(TM) i5 CPU M 450 @ 2.45GHz with 4GB RAM memory . The first tw o columns, document(s) length in KB, and the total number of sentences, represent the input v alues. While the rest four columns, te xt modeling by means of LHS and con v entional all-to-all comparing methods, sentece selection with I/O access ef ficient node ranking and archetypal analysis based sentence selection, are the measured times or the output v alues of the e xperiment. Doc.(s) length # of sentences T e xt modeling Sentence selection LSH all to al l node ranking AA 1KB 13 0.008 0.09 0.021 1.73 20KB 160 0.054 1.40 0.056 1.11 45KB 366 0.138 2.28 0.139 4.82 100KB 744 0.303 5.89 0.308 9.21 644KB 5112 4.110 191.06 4.123 207.89 this w ork are timeline and comparati v e summarization. T imeline summarization aims at producing a sequence of compact summaries for ne ws sets broadcast at v arious periods. Comparati v e Ne ws Summarization aims to outline the mutualities and contrasts between comparable ne ws subjects. In this paper , we propose a scalable solution to multi-document summarization based on the randomized algorithms. Man y sentence similarity graph generation algorithms mak e use of some distance similarity (e.g., cosine similarity) to measure pairwise distance between sets of v ectors representing corresponding sentences. Assume that we are gi v en n sentences with a maximum of m terms. Calculating the full similarity matrix w ould tak e time com- ple xity n 2 m . Man y no v el summarization tasks such as the comparati v e, update and time-line summarization require processing the lar ge number of sentences. Ha ving an n 2 m algorithm in such setups w ould be v ery impractical. F ortu- nately , we can borro w some ideas from the Math and Theoretical Computer Science to de v elop a scalable document summarization algorithm proportional to nm . The essence of our methods lies in defining Locality Sensiti v e Hash (LSH) functions. LSH functions in v olv e the creation of short signatures (fingerprints) for each v ector in space such that those v ectors that are closer to each other are more lik ely to ha v e similar fingerprints. LSH functions are generally based on randomized algorithms and are probabilistic. The contrib ution of this paper is fourfold: 1) P aper presents a ne w f ast sentence selection algorithm able to scale ef fortlessly; 2) W e describe the method for sub-linear time te xt modeling by means of sentence similarity graph and v ery ef ficient node ranking in those graphs; 3) No indi vidual part of our method is ne w or re v olutionary . Locality sensiti v e hashing has been done before, as ha v e node ranking and its usage in summarization. The no v elty is in the combination of these indi vidually useful parts into a single, coherent, real-time summarization system. W e ha v e not seen LSH nor our node ranking implementation applied to summarization in this w ay before; 4) W e e xtensi v ely e v aluated our method on fe w dif ferent summarization tasks. The remainder of the paper is or g anized as follo ws: Section 2 first briefly pre sents the related w ork. In Section 3 we describe the centerpiece of this w ork namely the f ast document summarization method. Section 4 gi v es the description of the test en vironment and data sets we used for tes ting. The results are also presented i n Section 5. W e conclude the paper and set guidelines for further w ork in Section 6. 2. RELA TED W ORK Our w ork is related to v arious research fields including general and query focused summarization [1, 2, 3, 4, 6, 5, 7], timeline summarization [8, 9], comparati v e summarization [10], sampling-based document summarization algorithms [11], node ranking of the sentence similarity graph [12, 13, 14] and similarity search for high dimensional data objects [15, 16]. F ollo wing paragraphs gi v e a brief surv e y of these w orks. In [11], authors use Random Inde xing for te xt modeling. Random inde xing presents a computationally ef ficient w ay of implicit dimensionali ty reduction. It in v olv es ine xpensi v e v ector computations such as addition and thus pro vides an ef ficient w ay to compute similarities between w ords, sentences and documents. The algorithm that we use in this paper , min-hash [17], w as originally de v eloped for returning only the authoritati v e documents in search results. Another closely-related problem is one kno w as the te xt reuse [18]. In contrast to near -duplicate detection, the focus is usually on smaller se gments of te xt as opposed to entire documents. Other similar formulations of the problem are what the data mining community calls pairwise similarity search or all pairs search [19] and what the database community calls set similarity join [20]. IJECE V ol. 6, No. 3, June 2016: 945 954 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 947 Our pre vious w ork [3, 4, 5] present an e xtracti v e summarization frame w ork based on three alternati v e models to inte grate the archetypal analysis based sentence selection: (1) the plain archetypal analysis sentence clustering and ranking for general; (2) the weighted archetypal analysis sentence selection for the query focused document summarization and (3) the weighted hierarchical archetypal analysis sentence selection for 4 dif ferent summarization tasks. T imeline summarization (TS for short) has become a widely adopted, natural w ay to present long ne ws stories in a compact manner . Existing approaches for TS aim to generate a good daily summary for each of these dates (e.g, [8, 9]). In this study , we set our focus on sho wing ho w the presented method can directly without e xtra ef fort be used in TS problem. Comparati v e multi document summarization (CDS) is first proposed in [10] t o summarize dif ferences be- tween comparable document groups. [10] presents a sentence selection strate gy modeled by m eans of conditional entrop y , which precisely discriminates the documents in dif ferent groups. Graph-based methods lik e T e xtRank [12] and P ageRank [13] model a document or a set of documents as a te xt similarity graph, constructed by taking sentences as v ertices and the similarity between sentences as edge weights. The y tak e into account the global information and recursi v ely calculate the sentence significance from the entire te xt graph rather than simply relying on unconnected indi vidual sentences. From an NLP perspecti v e, e xtracti v e summarization embodies tw o criteria: sentence rele v ance and sentence redundanc y . Graph-based sentence ranking algorithms successfully mer ge both of these criteria into a single frame w ork, by utilizing the so-called graph-based le xical centrality principle. Graph-based ranki n g algorithms were also used in query-focused summarization when it became a popular research topic. F or instance, a topic-sensiti v e v ersion of Le xRank is proposed by [14]. It inte grates the rele v ance of a sentence to the query into Le xRank to get a biased P ageRank ranking. Similar w ork to ours [21] presents a ne w principled and v ersatile summarization frame w ork MDS using the submodal function. This frame w ork can deal with dif ferent summarization tasks, including generic, query-focused, updated, comparati v e summarization. The empirical results sho w that this frame w ork outperforms the other ri v als in the generic summarization and is competiti v e in other summariza tion tasks. In [22] authors ha v e in v estig ated the use of maximum entrop y , nai v e-Bayes, support v ector machine models and a h ybrid machine model for multi-document automatic te xt summarization. 3. F AST DOCUMENT SUMMARIZA TION 3.1. Moti v ation The trend in automatic document summarization approaches found in state of the art systems proposes a general summarization methods which consists of the follo wing sub-tasks: 1. T e xt modeling: con v ert the te xt into, for instance graph representation 2. Sentence ranki ng : identify the salient sentences from gi v en te xt model 3. Summary generation: e xtract selected sentences into final summary . In order to obtain the sentences similarity graph one needs to compute the similarity v alues for all possible pairs of sentences in order to connect them in the sentence similarity graph. Mainly the v ector space model is used to represent sentences from gi v en documents. The v ector space model is an algebraic model for representing sentences as v ectors of terms. The cosine similarity is a measure of similarity between tw o v ectors of an inner product space that measures the cosine of the angle between them. Assuming that multiplication and addition are constant-time operations, the time comple xity of computing the cosine similarity where m is the biggest number of terms is therefore O ( m ) + O ( m ) = O ( m ) . But since we need to compute the sentence similarity for e v ery pair of sentences then the time and space comple xity of generating the sentence similarity graph becomes O ( n ( n 1) = 2) , here n is the number of sentences. In order to gi v e a better gist of the time comple xity we report the elapsed time in producing the summary for dif ferent document(s) lengths in in T able 1. Not only the similarity graph calculation is time e xpensi v e, b ut usually sentence selection methods are also v ery time consuming. Hence, this paper presents the w ay for using the f ast randomized approximation algorithm (i.e., LSH and min-hash), to deal with the quadratic comple xity of the con v entional te xt modeling techniques. Our approximation algorithm utilizes Locality-Sensiti v e Hashing, abbre viated as LSH herea fter , which is a probabilistic approximation algorithm for the nearest neighbor search. W e do not only try to increase the speed and the scalabilit y of the summary production system on the te xt modeling le v el b ut we also present our contrib ution on the sentence selection/e xtraction le v el. F or te xt modeling we propose the follo wing method (Essential Steps in similarity graph computation):1. N- gram e xtraction: Con v ert sentences into sets 2. Min-Hashing: Con v ert wide sets to short signatures, while preserving similarity 3. Locality-Sensiti v e Hashing: Concentrate on couples of signatures probable to originate from similar sentences F ast Document Summarization using Locality Sensitive Hashing and Memory ... (Er can Canhasi) Evaluation Warning : The document was created with Spire.PDF for Python.
948 ISSN: 2088-8708 D oc u m e nt s Fi n d in s i mi lar  sen t e ces  w ith  L H n - gr am   e xtrac tion m i n hashing m i n hashing m i n hashing l oca l ity  se n si ti ve  hashi n g L HS   b ased   sen ten ce  sim il arit g raph  as  ad j a cen cy  li s ts I/O   acc ess e f f ici en t   n od ran ki n g S en ten ce  sel ection an d   or derin g S um m ar y   se t   of  stri n gs  of   l engt n si gn a tu re s:   sh ort  i n te ger  ve ctors  r epres ent i ng  th e   se t   c and i d a t e   p a i rs  f or   si m i la ri ty  graph Of f []( of f se ts  f or  nod e   i N b []( p oi n ter t o   out-neighbors) n ode- r anke d   l is t   of  sentence s Figure 1. Method o v ervie w Gi v en the sets of similar sentences we can v ery ef ficiently compute the sentence ranking by mapping the prob- lem of sentence selection to node ranking in the graph of similar sentence sets. F or sentence modeling we propose the follo wing method: Essential steps in I/O ef ficient node ranking 1. Get the input sentence similarity graph represented as a set of three arrays 2. Produce the sentence ranking by recursi v ely computing the eigen v ector decompositions 3. Return the sentence ranking In the rest of the section we describe these steps in more details. 3.2. F ast similarity graph computing In this subsection, the problem of sentence similarity is fist described as search for the sets with a approx- imately big intersection. W e then sho w ho w the problem of finding te xtually similar s entences can be turned into such a set problem by representing gi v en te xt entities as a set of n-grams. Then, we present ho w method kno wn as min-hashing can be used for shortening these huge n-gram sets while preserving the similarity information of the underlying sets. And finally we utilize the locality-sensiti v e hashing for adjusting our search on couple of sentences that are most probable to be similar . Let us assume the similarity of the pair of sentences can be deduced by barely looking at the relati v e size of their intersection. This is the similarity meas u r e kno w as Jaccard similarity . If the Jaccard similarity of sets W and Z is j W \ Z j = j W [ Z j , than the Jaccard similarity of sentences S 1 and S 2 can be denoted as S I M ( S 1 ; S 2 ) . The form of similarity we are utilizing here is character -le v el similarity . A v ery simple b ut producti v e method for representing sentences as sets is to describe them as the sets of v ery short strings that occur within sentences. In this w ay sentences that share pieces as short as w ords or e v en syllable will ha v e man y common elements in their sets, e v en if those common entities appear in dif ferent orders in the tw o sentences. Define a n-gram for a sentence to be an y substring of length n found within the sentence. Then, correlate each sentence with the set of n-grams that appear one or more times wi thin that sentence. Instead of manipulating the sub-strings as n-grams, we choose a hash function that maps them to some num b e r of b uck ets and handle the final b uck et number as the n-gram. The set defining a sentence e v entually becomes a set of inte gers that are b uck et numbers of one or more n-grams that appear in the sentence. In this w ay we drastically compress the original te xtual data. IJECE V ol. 6, No. 3, June 2016: 945 954 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 949 Algorithm 1 F ast min-hashing algorithm 1: pr ocedur e M I N H A S H I N G ( S [] ; H [] ; n; k ) 2: Input: S (Set of n-grams), H ( N random hash functions), n (number of n-grams), k (number of hash func- tions); 3: Output: c set of min-hash signatures of the input set of n-grams S ; 4: c []   ne w int [ n ] 5: f or i = 0 to n do 6: c [ i ]   1 7: f or i = 1 to n do 8: if S(i)==1 then 9: f or i = 0 to k do 10: if h j ( i ) == c j then 11: c [ i ]   1 12: end pr ocedur e Algorithm 2 Approximate nearest neighbor search 1: pr ocedur e L S H ( M [ ; ] ; s; b; r ) 2: Input: M (min-hash signature matrix), s (similarity threshold), b the number of bands, r the number of ro ws; 3: Output: F set of documents with jaccard similarity at least s ; 4: Di vide matrix M into b bands of r ro ws 5: f or each b in band do hash b portion of each column to a hash table with k b uck ets; mak e k as lar ge as possible 6: end f or 7: Candidate column pairs are those that hash to the same b uck et for 1 band 8: T une b and r to catch similar sentences 9: end pr ocedur e Since the sets of n-grams are usually lar ge, one can replace them by much smaller representations called signatures. Signatures can be calculated using the method kno wn as the min-hashing, briefly gi v en in Algorithm 1. This technique is de v eloped to guarantee that tw o similar objects generate hashes that are themselv es similar . In f act, the similarity of the hashes has a direct relationship to the similarit y of the sentences the y were generated from. This ratio temps to approximate the Jaccard Similarity . Although we use min-hashing to compress lar ge sets into small signat ures while yet preserving the e xpected similarity of an y pair of sentences, there is still another v ery important issue. Finding the pairs of sentences with greatest similarity ef ficiently can be v ery time consuming. The reason is that the number of pairs of sentences may be too lar ge. The brute-force approach w ould be to com pare each sentence with each other sentence, using MinHash, which ob viously has the quadratic time comple xity . A f aster solution is to use Locality Sensiti v e Hashing (LSH). This tak es the MinHash v alues for sentences and hashes the MinHash v alues so the y hash into the same b uck et if the y are similar . The brief algorithm is described in 2. Note that the computation time for LSH with MinHash depends only on the number of sentences and number of MinHash functions used and not on the length of the sentences. W e can no w gi v e an approach to finding the set of candidate pairs for similar sentences and then disco v ering the truly similar sentences among them: 1. Pick a v alue of n and construct from each sentence the set of n-grams. 2. hash the n-grams to shorter b uck et numbers. 3. Sort the sentence and n-grams pairs to order them by latter . 4. Pick a length n for the minhash signatures. Feed the sorted list to the algorithm 1 to compute the minhash signatures for all the sentences. 5. Choose a threshold t that defines ho w similar sentences ha v e to be in order for them to be re g arded as a desired ”similar pair . Pick a number of bands b and a number of ro ws r such that b r = n , and the threshold t is approximately (1 =b )1 =r . 6. Construct candidate pairs by applying the LSH technique described in algorithm 2. F ast Document Summarization using Locality Sensitive Hashing and Memory ... (Er can Canhasi) Evaluation Warning : The document was created with Spire.PDF for Python.
950 ISSN: 2088-8708 7. Connect the sentences in the similarity graph based on the candidate pairs calculated by LHS. 3.3. Sentence selection In this subsection we describe an I/O ef ficient graph based ranking method for sentence selection from the graph of sentences. The construction methodology of graph w as presented in pre vious subsection. The idea has been v astly used in document summarization since the pioneering w orks kno wn as P ageRank [pagerank], and te xtrank [te xtrank]. T o ef ficiently compute the P ageRank scores for a big graphs , the input sentence similarity graph has to be represented as a binary l ink structure, more specifically as a set of three arrays: S enL (list of the n sentences), O f f (array of inte gers which denotes the of fsets of list for node i ) N b (array of inte gers which denotes the pointers to out- neighbors); Using the abo v e structure, a simple I/O ef ficient P ageRank algorithm can be written in Algorithm 3. Note that e xcept for new pr [] array , which represents the P ageRank v alues, all arrays are scanned only once sequentially from front to end. Gi v en the node ranking our summarization approach will e xtract the most important nodes, i.e sentences, to include in the summary . Here an additional sentence penalization step is applied. Suppose x i is the highest rank ed sentence. Sentence x i is mo v ed to set of sentences representing the final summary , and then the redundanc y penalty is imposed to the o v erall rank score of each sentence link ed with x i as follo ws: for each sentence x j , its rank score R S cor e ( x j ) is computed by: R S cor e ( x j ) = R S cor e ( x j ) (1 S im j i ) t ; t > 0 is the e xponent decay f actor . The lar ger t is, the greater penalty is imposed to the o v erall rank score. If t = 0 , no di v ersity penalty is imposed at all; In our e xperiments we set t = 3 ; Presented penaliza tion algorithm is based on the idea that e xtracting the o v erall rank score of less informati v e sentences o v erlapping with the sentences in summary is iterati v ely decreased. Here, redundanc y remo v al is also the k e y step of content selection. Finally , the sentence with the highest rank score is chosen to produce the summary until satisfying the summary length limit. Algorithm 3 I/O ef ficient node ranking 1: pr ocedur e S E N T E N C E R A N K I N G ( S enL; O f f ; N b ) 2: Input: S enL (list of the n sentences), O f f (array of inte gers which denotes the of fsets of list for node i ) N b (array of inte gers which denotes the pointers to out-neighbors); 3: Output: node-rank ed list of sentences; 4: n   S enL:C ount 5:   0 : 15 6: m   10 7: pr []   new f l oat [ n ] 8: new pr []   new f l oat [ n ] 9: f or i = 0 to n do 10: pr [ i ]   1 =n 11: new pr [ i ]   (1 ) =n 12: f or k = 0 to m do 13: f or i = 0 to O f f :C ount 1 do 14: outd   O f f [ i + 1] O f f [ i ] 15: f or j = O f f [ i ] to ( O f f [ i + 1] 1) do 16: new pr [ N B [ j ]]+ = ( beta ( pr [ i ] =outd )) 17: new pr [ i ]   (1 ) =n 18: end pr ocedur e 4. EXPERIMENTS In this section, we conduct e xperim ents to e v aluate the ef fecti v eness and possible positi v e contrib utions of the proposed method compared with other e xisting summarization systems on fe w dif ferent summarization tasks including General/Query , Comparati v e, and T imeline summarization. IJECE V ol. 6, No. 3, June 2016: 945 954 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 951 4.1. Ev aluation Metric W e used the metric based on the R OUGE scores that are widely used in traditional summarization tasks. Recall Oriented Understudy for Gisting Ev aluation (R OUGE) e v aluation package [23], compares v arious summary results from se v eral summarization methods with summaries generated by humans. In timeline e v aluation tasks, the quality of dif ferent TSs are compared via F-measure of the R OUGE-1, R OUGE-2. In this paper , we adopted the same metrics, plus the additional R OUGE-S*. T echnically , R OGUE-S* is computed the same as bigram-based R OGUE- 2 scores, b ut it allo ws the w ords in the bigram to be aparted by a wi nd o w . This mak es R OGUE-S* capture better the global distrib utional semantics, whil e traditional R OGUE-Ns capture better the local semantics, i.e. s entence to sentence matching. The set of input parameters for F astSum namely 1) the number of ngrams; 2) the number of bands; 3) the number of ro ws; 4) and the number of elements are separately defined for each kind of treated summarization task as reported in T able1. The rational for picking up these v alues are purely empi rical and are based on the e xperiments presented in the rest of the paper . T able 2. F astSum Input parameters T ask #ngrams #bands #ro ws #elements General/Query 4 20 2 40 T imeline 4 20 3 60 Comparati v e 6 20 5 100 4.2. General summarization W e use the DUC05 and DUC06 data sets to e v aluate our proposed method empirically on general and query focused summarization tasks. Benchmark data sets are from DUC 1 for automatic summarization e v aluation. DUC05 and DUC06 data sets consist of 50 topics. The task is to c reate a summary of no more than 250 w ords. In those document data sets, stop w ords were remo v ed using the stop list 2 and the terms were stemmed using the Porter’ s scheme 3 , which is a commonly used algorithm for w ord stemming in English. T able 3. Ev aluation of the methods on the DUC2005 dataset. Summarizers R OUGE-1 R OUGE-2 R OUGE-SU4 A vg-Human 0.4417 (1) 0.1023 (1) 0.1622 (1) A vg-DUC05 0.3434 (7) 0.0602 (6) 0.1148 (7) System-15 0.3751 (4) 0.0725 (4) 0.1316 (4) System-4 0.3748 (5) 0.0685 (5) 0.1277 (5) Biased-Le x 0.3861 (3) 0.0753 (3) 0.1363 (3) wAASum 0.3945 (2) 0.0797 (2) 0.1420 (2) F astSum 0.3697 (6) 0.0506 (7) 0.1168 (6) W e w ork with the follo wing methods for general/query-focused summarization as the baseline systems to compare with our proposed method: (1) A vg-Human: the a v erage human summarizer on DUC05(06); (2) A vg-DUC05(06): the a v erage system summarizer; (3) System-15(24): The bes t system-summarizer from DUC05(06); (4) System-4(12): The second best system summarizer from DUC04(05); (5) Le x-P ageRank: by calculating the eigen v ector centr ality gi v en the sentence to sentence similarity graph the method e xtracts the most significant sentences; (6) wAASum: weighted Archetypal analysis summarization system of the sentence similarity graph; (7) F astSum: the method presented by this paper . Although there are, for each year , more than 30 systems that ha v e participated in DUC competition, here we only compare with the DUC human best, the DUC human a v erage, the DUC system best and the DUC system a v erage result. 1 http://duc.nist.go v 2 ftp://ftp.cs.cornell.edu/pub/smart/english.stop 3 http://www .tartarus.or g/martin/PorterStemmer/ F ast Document Summarization using Locality Sensitive Hashing and Memory ... (Er can Canhasi) Evaluation Warning : The document was created with Spire.PDF for Python.
952 ISSN: 2088-8708 T able 4. Ev aluation of the methods on the DUC2006 dataset. Summarizers R OUGE-1 R OUGE-2 R OUGE-SU4 A vg-Human 0.4576 (1) 0.1149 (1) 0.1706 (1) A vg-DUC06 0.3795 (7) 0.0754 (6) 0.1321 (7) System-24 0.4102 (3) 0.0951 (2) 0.1546 (4) System-12 0.4049 (5) 0.0899 (4) 0.1476 (5) Biased-Le x 0.3899 (6) 0.0856 (5) 0.1394 (6) wAASum 0.4238 (2) 0.0917 (3) 0.1671 (2) F astSum 0.4086 (4) 0.0710 (7) 0.1616 (3) T able 5. Results in comparati v e summarization: Sentences selected by our proposed F astSum approach. ID Selected sentence 1 At the Madrid summit last December , leaders of EU member nations agreed unanimously that the European single currenc y will be formally launched on January 1, 1999. 2 The W a National Or g anization, the P alaung State Liberation Front and the Lahu Democratic Front said the arrest were an insulting act of shameless, barbaric arrog ance ag ainst the people of Burma. 3 ET A, which stands for Basque Homeland and Freedom, has killed nearly 800 people in its 30-year campaign for an independent Basque nation carv ed out of parts of northern Spain and southern France. 4 Unemplo yment in France fell to 11.6 percent in October from 11.7 percent, reflecting a slo w b ut steady impro v ement in France’ s high jobless rate, long one of the nation’ s knottiest problems. 5 Researchers e v aluate o v erweight and obesity using a measure called body mass inde x BMI , which di vides weight in kilograms by the square of height in meters. T ables 3 and 4 sho w the R OUGE scores of dif ferent methods using DUC05 and DUC06 data sets, respec- ti v ely . The higher R OUGE score indicates the better summarization performance. The number in parentheses in each table slot sho ws the ranking of each method on a specific data set. Ev en thought our results are not among the best we sho w that by sacrificing a rather small percentage of recall and precision in the quality of the produced summary can reduce the quadratic to sub-linear time comple xity of other typical summarization systems. 4.3. Comparati v e summarization In this section we in v estig ate one of the recent summarization tasks, first proposed by [10]. W e model the comparati v e summarization as follo ws: Extract the summaries S 1 ; S 2 ; :::; S N from the gi v en N groups of documents G 1 ; G 2 ; :::; G N . Extracted summaries should be as di v er gent as possible from one another on the topic le v el while still e xpressing central themes of corresponding groups. W e propose a follo wing function for the comparati v e summ arization to generate the discriminant summary for each group of documents: C s = " ( i; j ) j ( sim nor m ( s i ; s j ) if G ( s i ) = G ( s j ) sim nor m ( s i ; s j ) if G ( s i ) 6 = G ( s j ) # t t (1) where G ( s i ) is the document group containing sentence s i , sim nor m ( s i ; s j ) is the normalized sentence similarity . Evaluation: Since there is currently no dataset/methodology a v ailable to carry out a quantitati v e e v aluation of comparati v e summarization we used v e clusters of documents from the DUC07 corpora to generate comparati v e summaries using the F astSum method. The data set contains v e clusters as follo ws: 1. Steps to w ard introduction of the Euro; 2. Burma go v ernment change 1988; 3. Basque separatist mo v ement 1996-2000. 4. Unemplo yment in France in the 1990s; 5. Obesity in the United States and possible causes for US obesity; Looking at the results by F astSum sentence selection method in T able 5, each of the sentences represents one cluster respecti v ely and summarizes well specific topics of each cluster . In T able 5, we also highlight some k e yw ords IJECE V ol. 6, No. 3, June 2016: 945 954 Evaluation Warning : The document was created with Spire.PDF for Python.
IJECE ISSN: 2088-8708 953 T able 6. A v erage results on 17 timelines, the reported results are computed 95% confidence interv al Summarizers R OUGE-1 R OUGE-2 R OUGE-SU4 Random 0.128 0.021 0.026 Chieu et al. 0.202 0.037 0.041 T ran et al. 0.230 0.053 0.050 F astSum 0.197 0.032 0.039 representing the unique features of each topic. Note that the sentence e xtracted by F astSum for each topic are not just discriminati v e b ut the y also present the essence of the topic. 4.4. T imeline summarization In order to e v aluate out method on timeline summarization task we used T imeline17 dataset [8]. Briefly , the dataset consists of 17 manual-created timelines and their associated ne ws articles. Data set are published online 4 . W e e v aluate our syst em ag ainst traditional multi document summarization and timeline generation systems. F ollo wing is the list of those systems: Random: The system generates day summary for each day by randomly selecting sentences for particular day . [9] is multi-document summarizer which utilizes the popularity of a sentence as TFIDF similarity with other sentences to estimate its importance. [8] The y use SVM-rank to demonstrate the performance of their system, which is one of the most common learn to rank implementa‘tions. Result: The a v erage results of TS generation on gi v en dataset are represented in T able 6. Although when compared to other tw o systems ours seems to perform more poorly this is mainly due to its simplicity which is payed by its scalability . 5. CONCLUSION AND FUTURE W ORK A particular challenge for graph based multi-document summarization methods is a lar ge time comple xity of at least O ( n 2 ) for te xt modeling and some additional comple xity of sentence selection. Hence we need ef fecti v e summarization methods to reduce this high time comple xity . In this paper we ha v e formalized the problem of the f ast and scalable document summarization method as combination of (1) the te xt modeling sub-problem of calculating the similarity graph based on locality sensiti v e hashing and (2) the sentence selection sub-problem of I/O access ef ficient node ranking. The contrib ution of the w ork can be summarized as: 1. The paper presents a ne w f ast sentence selection algorithm able to scale ef fortlessly . 2. W e describe the method for sub-linear time te xt modeling by means of sentence similarity graph and v ery ef ficient node ranking in those graphs. 3. No indi vidual part of our method is ne w or re v olutionary . Locality sensiti v e hashing has been done before, as ha v e node ranking and its usage in summarization. The no v elty is in the combination of these indi vidually useful parts into a single, coherent, real-time summarization system. W e ha v e not seen LSH nor our node ranking implementation applied to summarization in this w ay before; 4. W e e xtensi v ely e v aluated our method on fe w dif ferent summarization tasks. In future the F astSum may be further impro v ed. There are man y potential directions for impro v ements, such as: 1. impro ving the F astSum into a distrib uted real-time multi-document summarization system 2. adopting F astSum to and testing it as the system capable of scaling to man y serv ers and huge size of documents 3. in order to impro v e the quality of produced summaries one can enhance the sentence similarity calculation by using the w ordnet and by adopting the LSH to f ast semantic similarity calculation. 6. SOURCE CODE All the source codes can be do wnloaded as SVN check out at: https://github.com/ErcanCanhasi/FastDocumentSummarization.git REFERENCES [1] P ankaj B, Agra w al AJ. (2014) Extracti v e Based Single Document T e xt Summarization Using Clustering Ap- proach. In: IAES International J ournal of Artificial Intellig ence (IJ-AI) 2014; 3(2) . 4 http://www .l3s.de/˜gtran/timeline/ F ast Document Summarization using Locality Sensitive Hashing and Memory ... (Er can Canhasi) Evaluation Warning : The document was created with Spire.PDF for Python.
954 ISSN: 2088-8708 [2] Pedram V A, Omid SSh. Scientific Documents clustering based on T e xt Summarization. In: International J ournal of Electrical and Computer Engineering (IJECE) 2015; 5(4): 782–787 . [3] Canhasi E, K ononenk o I. Multi-document summarization via Archetypal Analysis of the content-graph joint model. Knowledg e and Information Systems (KAIS) , 2014; 41(3): 821–842. [4] Canhasi E, K ononenk o I. W eighted archetypal analysis of the multi-element graph for query-focused multi- document summarization. Expert Systems with Applications (ESW A) , 2014; 41(2): 535–543. [5] Canhasi, E., K ononenk o, I., W eighted hierarchical archetypal analysis for multi-document summarization. Com- put. Speech Lang. (2015), http://dx.doi.or g/10.1016/j.csl.2015.11.004 [6] Canhasi E, K ononenk o I. Automatic Extracti v e Multi-document Summarization Based on Archetypal Analysis. Non-ne gative Matrix F actorization T ec hniques. Spring er Berlin Heidelber g , 2016; 75-88. [7] Dipti YS. Ef fect of feature selection on small and lar ge document summarization. In: IAES International J ournal of Artificial Intellig ence (IJ-AI) 2014; 3(3) . [8] T ran GB, T ran A T , T ran NK, Alrif ai M, Kanhab ua N. Le v eraging Learning T o Rank in an Optimization Frame w ork for T imeline Summarization. 2013 [9] Chieu HL, Lee YK. Query based e v ent e xtract ion along a timeline. In: Pr oceedings of the 27th annual interna- tional A CM SIGIR confer ence on Resear c h and de velopment in information r etrie val, A CM. 2004; 425–432. [10] W ang D, ZhuS L, Gong TY . Comparati v e document summarization via discriminati v e sentence selection, TKDD 2012; 6(3): 1–12. [11] Chatterjee N, Mohan S. Extraction-based single-document summarization using random inde xing. In: T ools with Artificial Intellig ence , ICT AI 19th IEEE International Confer ence 2007; 448–455. [12] Mihalcea R, T arau P . T e xtRank: Bringing Order into T e xts In: Pr oceedings of Confer ence on Empirical Methods in Natur al Langua g e Pr ocessing , EMNLP , A CL 2004; 404–411. [13] Erkan G, Rade v DR. Le xRank: Graph-based le xical centrality as salience in te xt summarization. J ournal of Artificial Intellig ence Resear c h , 2004; 457–479. [14] Otterbacher J, Erkan G, Rade v DR. Biased Le xRank: P assage retrie v al using random w alks with question-based priors. Information Pr ocessing and Mana g ement , 2009; 45(1): 42-54. [15] Andoni A, Indyk P . Near -optimal hashing algorithms for approximate nearest neighbor in high dimensions. In: F oundations of Computer Science FOCS’06. 47th Annual IEEE Symposium on, (IEEE) 2006; 459–468. [16] Henzinger M. Finding ne ar -duplicate web pages: a lar ge-scale e v aluation of algorithms. In: Pr oceedings of the 29th annual international A CM SIGIR confer ence on Resear c h and de velopment in information r etrie val, (A CM SIGIR) 2006; 284–291. [17] Broder AZ, On the resemblance and containment of documents. In: Compr ession and Comple xity of Sequences 1997. Pr oceedings, (IEEE) 1997; 21–29. [18] Bendersk y M, Croft WB. Finding te xt reuse on the web . In: Pr oceedings of t he Second A CM International Confer ence on W eb Sear c h and Data Mining , (A CM) 2009; 262–271. [19] Bayardo RJ, Ma Y , Srikant R. Scaling up all pairs similarity search. In: Pr oceedings of t he 16th international confer ence on W orld W ide W eb, (A CM) 2007; 131–140. [20] V ernica R, Care y MJ, Li C. Ef ficient parallel set-similarity joins using MapReduce In: Pr oceedings of the 2010 A CM SIGMOD International Confer ence on Mana g ement of data , A CM 2010; 495–506. [21] Li J, Li L, Li T , Multi-document summarization via submodularity . Applied Intelligence, 2014; 37(3); 420–430. [22] F attah MA. A h ybrid machine learning model for multi-document summarization. Applied intelligence 2014; 40(4): 592–600. [23] Lin CY . Rouge: a package for automatic e v aluation of summaries. In: T e xt summarization br anc hes out: pr o- ceedings of the A CL-04 workshop of A CL 2004; 74-81. BIOGRAPHY OF A UTHOR Er can Canhasi recei v ed his Ph.D. in 2014 from Uni v ersity of Ljubljana, Slo v enia. He is a assistant professor at the F aculty of Computer Science i n Prizren. His research interests include te xt mining, natural language processing and te xt summarization. He is the (co)author of fe w papers. Further info on his homepage: https://sites.google.com/a/uni-prizren.com/ercancanhasi/ IJECE V ol. 6, No. 3, June 2016: 945 954 Evaluation Warning : The document was created with Spire.PDF for Python.