Exponential growth in textual information available online has spurred the need for automatic processing of text for various tasks that humanity performs on computing devices. Automatic document summarization is one such task with a compelling ability to combat the problem of information overload. Proposed more than six decades ago by Luhn (1958), modest progress in the area of automatic summarization is evident by moderate scores achieved by sophisticated deep neural methods (Fang et al., 2017; Nallapati et al., 2017; Dong et al., 2018; Zhou et al., 2018; Narayan et al., 2018; Yasunaga et al., 2019; Alguliyev et al., 2019; Dou et al., 2021; Zhong et al., 2020a; Khurana and Bhatnagar, 2020)
. Further, recent debate and consequent surge in study of evaluation metrics for automatic summaries is a clear and strong testimony to the considerable complexity of the task(Peyrard, 2019a, b; Ermakova et al., 2019; Bhandari et al., 2020b; Vasilyev and Bohannon, 2020; Fabbri et al., 2020; Huang et al., 2020; Bhandari et al., 2020a).
Research in designing and advancing automatic document summarization methods is largely driven by the objective to perform well on a few standard data-sets (Peyrard, 2019a). The author further argues that community has exerted over crafting algorithms to improve performance evaluation scores on the benchmark data-sets, thereby limiting progress in the science of automatic extractive document summarization. Similar searching questions have been asked by Huang et al. (2020), who design a multi-dimensional quality metric and quantify major sources of errors on well-known summarization models. The authors underscore faithfulness and factual-consistency of extractive summaries compared to abstractive counterparts, based on the designed metric.
Existing extractive summarization methods rely on the intuitive notions of non-redundancy, relevance, and informativeness as signals of appositeness for inclusion of sentences in summary. While non-redundancy and relevance have been modeled in several earlier works (Luo et al., 2010; Alguliev et al., 2011; Gupta et al., 2014; Parveen et al., 2015; Nallapati et al., 2016, 2017; Huang, 2017; Alguliyev et al., 2019; Saini et al., 2019), the notion of informativeness of a sentence is completely unattended. Despite sophisticated supervised and unsupervised techniques, existing approaches for document summarization are fundamentally devoid of any theory of importance. Peyrard (2019a) contends that the notion of importance unifies non-redundancy, relevance and informativeness, the summary attributes hitherto addressed by the research community in an adhoc manner.
Emphasis on modeling the vague human intuition of importance in complete absence of a theoretical model is the primary impediment in advancing the science of automatic document summarization to technology. Peyrard (2019a) proposes to mitigate the problem by gleaning probability distribution of semantic units of the document and computing entropy, to encode non-redundancy, relevance, and informativeness of a semantic unit as a single attribute, i.e., importance. Admitting sentences and topics as crucial semantics units for document summarization, we present an in-depth analysis of sentence and topic entropy in latent semantic space revealed by Non-negative Matrix Factorization (NMF). Based on entropy, we propose an effective sentence scoring function for generating document summary. Our research contributions are listed below.
We delve into the latent semantic space of the document exposed by NMF (Sec. 3). We compute probability distribution and entropy of semantic units and present corresponding interpretations (Sec. 4.1, 4.2). We deliberate over the complex interplay of topic and sentence entropy by studying its impact on summary quality and corroborate our observations with pertinent empirical analysis (Sec. 4.4).
We propose E-Summ, an unsupervised, generic, explainable, and language agnostic algorithm for extractive document summarization (Sec. 5).
We also evaluate algorithmic summary quality by computing its semantic similarity with the complete document and show that the reference summaries have relatively less semantic overlap compared to E-Summ summaries (Sec. 7.4).
We observe that despite a sound theoretical foundation, E-Summ is not able to match the ROUGE score of deep neural methods. We discuss this observation in detail in Sec. 8.
2 Background and Related Work
In this section, we first describe the proposal forwarded by Peyrard (2019a), which inspires the current work. Next, we present a review of works that uses entropy for document summarization. A subsection on Non-negative Matrix Factorization (NMF) follows, which is the method used to divulge the latent semantic space of the document.
2.1 Information theoretic approach for Document Summarization
Peyrard assumes that a document can be represented as probability distribution over the semantic units, , where could be interpreted either as the frequency of unit or its normalized contribution to the meaning of . Terms, topics, frames, embedding, etc., present as possible semantics units comprising the document.
Relevance of the semantic units comprising the summary is critical for reducing the uncertainty about . Low redundancy of semantic units in summary implies high coverage. High coverage in turn translates to high informativeness, which is modeled by entropy in a straightforward manner. Thus high entropy semantic units, when included in summary, automatically reduce redundancy and augment informativeness.
Peyrard (2019a) claims that once the background knowledge () of the reader is modeled, information theoretic framework is powerful enough to create personalized summaries. Importance of the semantic units to be included in summary blends informativeness and relevance. Author characterizes the importance of the semantic unit using function where , are respective probabilities of in document and background knowledge . Here, encodes importance of semantic unit and is required to satisfy the following conditions.
Informativeness: , if and then . This prefers inclusion of in summary, which is more informative for the user.
Relevance: , if and then . This condition implies that out of two semantic units - and , occurring with equal probability in background knowledge () of the user, one that is more frequent in is relevant for the summary.
Additivity: This condition combines informativeness and relevance, using Shannon’s theory of information, defined as: , where are user-defined parameters representing strength of relevance and informativeness respectively.
Normalization: , which ensures defines a valid probability distribution.
Author establishes that a function satisfying above four requirements will have fixed form: , where , where (See Peyrard (2019a) for proof). Therefore function implicitly codes relevance and informativeness.
2.2 Entropy for Document Summarization
Earlier works use entropy for document summarization either as a method of ranking sentences or for evaluating summaries. Kennedy et al. (2010) use entropy to measure the quality of summary by calculating the amount of unique information captured in the summary. Words in the summary are mapped to concepts (topics) using Roget’s Thesaurus (Jarmasz and Szpakowicz (2004)), which contains approximately one thousand concepts with weight assigned to each concept. Normalizing the weights to obtain probability distribution, authors map summary words to the concepts and calculate entropy of the summary for quantitative assessment of its quality.
Luo et al. (2010) conjecture that sentence entropy proxies for coverage
of information by the sentence. Authors consider sentence as a vector of terms (content words) in the document and compute probability distribution of terms, which is used for calculating entropy of the sentence. Since high entropy of a sentence implies more coverage, the method has inherent bias towards long sentences, favoring their inclusion in summary.Yadav and Sharan (2018) gauge entropy in latent semantic space of the document by computing probability distribution of topics in sentences and vice-versa. They use Latent Semantic Analysis (LSA) to reveal the latent semantic space of the document. The interplay of topic and sentence entropy is used for selecting summary sentences. Since LSA factor matrices are used to compute the probability distribution of topics and sentences, authors are compelled to ignore negative terms in factor matrices, thereby losing substantial information.
2.3 NMF for Document Summarization
We use NMF to reveal the latent semantic space of the document. The preference for NMF over LSA is motivated by the presence of non-negative terms in NMF factor matrices, ensuing enhanced interpretability (Lee and Seung (1999)). Furthermore, the presence of non-negative terms overcomes the loss of information incurred by ignoring negative terms in LSA factor matrices.
Non-negative Matrix Factorization (NMF) is a matrix decomposition method for approximating non-negative matrix ( ) as in reduced space. Here, and are non-negative factor matrices and . Starting with non-negative seed values for and matrices, NMF algorithm iteratively improves both factor matrices to approximate by product of , such that Frobenius norm is minimized.
Considering document to be summarized as a sequence of sentences (), is represented as term-sentence matrix where columns of correspond to sentences and rows represent terms (). Accordingly, is an matrix where element in denotes occurrence of term in sentence .
NMF decomposition of reveals latent semantic space of the document via. two non-negative factor matrices and . Here, is term-topic (feature) matrix and is topic-sentence (co-efficient) matrix. Columns of matrix correspond to latent topics represented as (). Each element in gives the strength of term in topic whereas element in specifies the strength of topic in sentence and vice-versa. Both factor matrices , in latent space and input matrix can be efficiently exploited for document summarization.
Lee et al. (2009) propose NMF based unsupervised extractive document summarization method. The authors score the sentences by computing Generic Relevance Score (GRS) for each sentence. More recently, Khurana and Bhatnagar (2019) propose NMF based methods for extractive document summarization. The methods use term-oriented and topic-oriented approach for scoring sentences in the document.
3 Interpreting NMF Factor Matrices
Every well written document has a theme, and sentences in the document carry information about topics within the scope of the theme. It is, therefore, reasonable to posit that sentences and topics are two prime carriers of information (semantic units) contained in the text document. Extractive summary aims to extract informative sentences that communicate important topics in the document.
Information in a document is woven as sentences by the terms (content words) glued by stop-words (non-content words). Further, since the information in the document is embodied by both sentences and topics, both sentence and topic entropy are potent to gauge informativeness of the document. Naturally, the two are intricately related by terms, which are the common foundational semantic units. In this and the next section, we delve into the computation, interpretation, and the interplay of sentence and topic entropy for generating informative summaries.
Terms, sentences, and topics (in latent space), the three semantic units in a document, are inter-related, as shown in Fig. 1. Arrows denote the relation “consists-of”, i.e. sentences consists-of terms, as do topics (in latent space). Sentences consists-of topics (in latent space), while topics consists-of sentences. Sentences and topics are intricately intertwined as sentences discuss topics and topics are spread over sentences. Hence, the two have a symbiotic relationship.
Non-negative Matrix Factorization (NMF) of term-sentence matrix () of a document reveals topics in the latent space through the feature matrix () and co-efficient matrix (). Quantitative relationships between the three semantic units of the document is encoded by and matrices (Fig. 1). Matrix quantifies contribution of terms in latent topics, while quantifies a linear interdependent relationship between topics and sentences (Sec. 2.3).
Fig. 2 illustrates relationships between terms, sentences and topics for an example document from DUC2002 data-set111Document No. - AP881118-0104. The data-set is described in Sec. 6. (Fig. 2(a)). Application of community detection algorithm reveals four latent topics in the document (Sec. 6.2), which are exposed by non-negative matrix factorization method. Fig. 2(b) shows terms contributing in each latent topic in matrix. Note that some terms find mention in multiple topics, implying their semantic contribution in these topics. For example term “Hirohito” occurs in seven out of twelve sentences, contributing to three topics with different strengths.
Semantic contribution of sentences to topics and vice-versa is revealed in Figs. 2(c) and 2(d) respectively. Fig. 2(c) shows the sentences contributing in each topic. Intensity of contribution of a sentence in a topic is mentioned in parentheses along with the sentence. For example, sentences , , and contribute to with strengths , , , and respectively. Sentence has highest contribution in compared to other sentences. Fig. 2(d) shows sentences consisting of topics. Strength of a topic in a sentence is mentioned in parentheses along with the contributing topic. For example, sentence describes (consists-of) and , with the later having higher strength.
4 Realizing Sentence and Topic Entropy
Shannon entropy quantifies the expected information conveyed by a random variable with specified probability distribution. It furnishes an intuitive understanding of the amount of uncertainty about an event associated with a given probability distribution. Hence it is a property of probability distribution of the event, with higher value of entropy implying higher information content.
In order to compute sentence and topic entropy, it is imperative to mutate sentences and topics into their respective probabilistic versions. Since a sentence is composed of terms and topics in latent space simultaneously, probability distribution of sentences can be realized in two ways. Likewise, a topic is conjointly composed of terms and sentences, and its probability distribution can be realized using either terms or sentences (Refer to Fig. 1).
4.1 Computing Sentence Entropy
Entropy of a sentence quantifies the amount of information conveyed by the sentence. Sentences in a document are perceptible in both term space () as well as in latent topic space (). Therefore, we compute sentence entropy in two ways - (i) in term space, we use probability distribution of terms over sentences, and (ii) in the space of latent topics, we use probability distribution of topics in sentences.
Sentence Entropy in term space
Each column of term-sentence matrix is the term vector for the corresponding sentence in the document. Probability () of term in sentence is computed as . Hence, entropy of sentence in term space is given by
Entropy of a sentence in term space is proportional to its length. Empirical correlation between sentence length and sentence entropy (calculated from matrix ) for four data-sets is shown in Row 1 of Table 1. As conjectured by Luo et al. (2010), there is a high positive correlation between sentence length and sentence entropy in term space.
Sentence Entropy in space of latent topics
Column of the co-efficient matrix quantifies the contribution of the sentence in latent topics. Let be the corresponding normalized column, where element ( ) is the probability of the latent topic in sentence . Entropy of sentence in topic space is defined as
High entropy sentences contribute in multiple latent topics, and intuitively have larger coverage. Row 2 of Table 1 shows low correlation, slightly on the negative side, between sentence length and . This implies that sentence length is almost independent of entropy in latent topic space, i.e. longer sentences do not necessarily contribute to more topics. For example, sentence in Fig. 2(a) is the longest sentence in the document, but in latent semantic space it contributes to only one topic, i.e. .
We also report statistical correlation between and in Row 3 of Table 1. Low correlation values for three data-sets reveal that sentence entropy in term space and that in latent topic space are nearly independent. Negative correlation between and for DUC2002 data-set can be explained as follows. Though a longer sentence has higher entropy in term space, there is no guarantee that it contributes to multiple latent topics. This possibly lowers the sentence entropy in latent topic space. Fig. 2(a) shows that is the longest sentence in the document and has highest term space entropy. However, contributes only in one latent topic and has zero topic space entropy.
Correlation Between DUC2001 DUC2002 CNN DailyMail Sentence length and 0.811 0.856 0.862 0.872 Sentence length and -0.126 -0.169 -0.046 -0.049 and 0.012 -0.025 0.056 0.058 and 0.855 0.840 0.906 0.884 Table 1: Pearson’s Correlation Coefficient between Row 1: sentence length and sentence entropy calculated in term space, Row 2: sentence length and sentence entropy calculated from co-efficient () matrix, Row 3: sentence entropies calculated from term-sentence matrix () and co-efficient () matrix, Row 4: topic entropies calculated from NMF feature () and co-efficient () matrix
4.2 Computing Topic Entropy
Topics in a document are distributed over terms as well as the sentences in latent space. Probability distribution of topics over terms and sentences yields two articulations of topic entropy - (i) in the latent space of terms using feature matrix , and (ii) in the space of sentences using co-efficient matrix .
Topic Entropy in term space
Column quantifies the contribution of terms in corresponding latent topic. Let be the normalized column corresponding to , where is the probability of term in latent topic . Entropy of topic in term space is defined as
High entropy of a topic in term space indicates that more terms comprise the topic with nearly equal probabilities.
Topic Entropy in sentence space
Row in co-efficient matrix quantifies the contribution of sentences in the latent topic . Let be the normalized row corresponding to , where is the probability of latent topic contributing in sentence , computed as . Entropy of latent topic in sentence space is defined as
High entropy of a topic in space of sentences implies that the topic finds mention in multiple sentences.
4.3 Interpreting Sentence and Topic Entropy
A “good” extractive summary must comprise high quality semantic units to faithfully capture relevant information content from the document. Accordingly, informative summary can be realized by including informative sentences about important topics discussed in the document.
Selecting sentences with high favours inclusion of longer sentences in summary. Luo et al. (2010) relate higher entropy of the sentence to higher coverage, thereby assuming that the summary with longer sentences covers more aspects in the document than shorter ones. However, this assumption may not hold if the document is not well written. Often times, long sentences have diffused information either due to faulty construction or multiple ideas with little focus. Therefore, it is not prudent to employ as the sole criterion for scoring sentences. Recall that is length agnostic, and hence sentence entropy in topic space is better suited for measuring informativeness of a sentence.
High value of indicates higher information content of the latent topic in term space, which implies that the topic consists of more terms with nearly equal contributions. High value of signifies that the topic finds mention in many sentences. We empirically analyze the correlation between and . Row 4 of Table 1 reveals high positive correlation between the two, suggesting that either of the two can be used interchangeably. We choose to use as it is computationally efficient because of smaller size of .
Even though high entropy signifies high informativeness, it would be naïve to presume that selecting high entropy sentences describing high entropy topics is key to high quality summary.
4.4 Selecting summary sentences
In this section, we examine the possible ways in which topic and sentence entropy ( and , respectively) can be meaningfully combined to extract informative sentences from the document. Since low entropy implies less information, combination of low topic and low sentence entropy begets the negative effects of both, thereby degrading the quality of summary. Selecting sentences with high entropy from low entropy topics too is not prudent either, since low entropy topic is discussed in fewer sentences and is probably not important. We investigate two promising combinations viz. High TE & High SE and High TE & Low SE, in order to gain better insight into the interplay between topic and sentence entropy. Macro-averaged ROUGE scores of summaries generated by these combinations is presented in Table 2.
|High TE & High SE||R-1||36.94||38.65||29.30|
|High TE & Low SE||R-1||41.04||44.51||29.33|
By definition, high entropy semantic units are more informative and hence better candidates for inclusion in the summary. Sentence with high entropy signifies high informativeness since it discusses multiple latent topics. However, presence of important topics in the sentence is also desirable at the same time. Algorithmically, this entails selecting topic with high entropy and identifying high entropy sentence participating in it (Algorithm 1). Iterating the procedure with next highest entropy topic generates the summary of desired length. To rule out the possibility of repetition of sentences, sentence with the next highest entropy is selected in case the highest entropy sentence has already been included in summary.
Even though the aforementioned approach theoretically sounds satisfactory, empirical evaluation reveals the caveat. High entropy sentences selected from high entropy topics do not add much value to the summary because of thin spread of focus over topics leading to low quality summaries. Selecting the sentence with low entropy (i.e. focused on a topic) from the topic with high entropy is astute since low entropy sentence focuses the characterizing topic more sharply despite covering fewer topics (Algorithm 2).
Table 2 shows that the combination of High TE & High SE for sentence selection does not perform as well as the combination of High TE & Low SE. Significant gain in the performance by including low entropy sentences in the summary is somewhat counter intuitive. Complexity associated with interlacing topic and sentence entropy enjoins clever combination of information contained in topics and sentences.
5 Generating Informative Summary
In this section, we propose an unsupervised algorithm called E-Summ, which is an information theoretic method for extractive single document summarization. The method exploits information conveyed by topic and sentence entropies in the latent semantic space of the document to identify candidate sentences. Subsequently, it selects the summary sentences by optimizing the information content while delimiting summary length. E-Summ is an explainable, language-, domain- and collection- independent algorithm.
5.1 Sentence Scoring
Let be a random variable with denoting the probability of occurrence of event . Information associated with the event is quantified as and is referred as self-information (Goodfellow et al. (2016)). This definition asserts that a less likely event is more informative than a highly likely event. Accordingly, a sentence with the highest contribution (probability) in the topic has minimum self-information in that topic. Thus, intuitively a sentence with low self-information is more informative and hence a good representative of the topic. E-Summ algorithm uses this principled criterion for selecting informative sentences from important topics.
Let denote the event that sentence participates in latent topic , and be the probability of this event (Sec. 4.2). Using the notation for self-information, topic entropy in sentence space (Eq. 4) can be rewritten as
Based on the ground rule that a sentence with low self-information is the good representative of a topic, E-Summ identifies candidates by choosing sentences with the least self-information from each latent topic and assigns a score as the sum of sentence entropy in topic space () and topic entropy in sentence space () as follows.
E-Summ selects summary sentences from the set of candidate sentences such that the total score of the selected sentences is maximized for the pre-defined summary length. For this purpose, we use the classical Knapsack optimization algorithm. Given a set of items, each associated with weight and value, the algorithm selects a subset of items such that the total value of items is maximized for a constant capacity (weight) of the knapsack.
Each candidate sentence identified by E-Summ has two associated attributes - sentence length and score. Considering sentence length as item weight, score as item value, and required summary length as the capacity, the Knapsack algorithm selects sentences from the set of candidates to maximize the total score for the required summary length. Thus the algorithm maximizes the total information conveyed by chosen summary sentences.
Algorithm 3 presents the pseudo-code for the proposed E-Summ algorithm. We pre-process document by removing stop-words, punctuations and transform it to binary term-sentence matrix . Input parameters to E-Summ are matrix , number of latent topics , and desired summary length.
The algorithm decomposes matrix into two factor matrices and using NMF (step 1). In the interest of stability of NMF factor matrices, we use NNDSVD222According to our experience with NNDSVD initialization, randomization is not completely eliminated. initialization proposed by Boutsidis and Gallopoulos (2008). In step 2, the method calculates entropy of each latent topic in sentence space (Sec. 4.2). Next, for loop in steps 5 to 9, examines latent topics in descending order of their entropy and identifies the best representative sentence (with minimum self-information) for that topic. The algorithm appends the selected sentence to the set of candidates. Subsequently, in step 11, the Knapsack optimization algorithm is applied to maximize information content of the summary while delimiting the summary length.
In case the sentences selected by Knapsack algorithm do not complete the desired summary length, remaining candidates are considered in decreasing order of their score for inclusion in the summary (steps 12 to 16). However, when desired summary length is specified as number of sentences, application of Knapsack algorithm is omitted and top scoring candidates are selected for inclusion in summary.
E-Summ algorithm is an unsupervised document summarization method, which does not require external knowledge resources. It is generic, domain- and collection- independent and does not use language tools at any stage. The algorithm is topic-oriented and hence is highly effective for documents, which have differentiable topics.
The algorithm is efficient with computational complexity , where is the time to extract candidate sentences and time is required to execute the knapsack algorithm using dynamic programming. The complexity is linear in the number of latent topics (), number of sentences (), and desired summary length () and excludes the complexity of non-negative matrix factorization (step 1). Though NMF is an NP-hard problem (Vavasis, 2010)
, efficient polynomial-time algorithms that use local search heuristic are commonly available in libraries.
5.2 Explainability of E-Summ
Non-negativity constraint on NMF factor matrices enhances interpretability of latent space and accentuates explainability of E-Summ algorithm. Given the summary generated by the algorithm, it is possible to retrace the selection process and justify inclusion of sentences in summary. We illustrate the procedure using the document shown in Fig. 2(a), while decomposing it into four topics.
We elucidate the transparency of proposed E-Summ method using the document of Example 1. Fig. 3(a) presents the matrix specifying probability of participation of topics in sentences and self-information of sentences in the document. The top component of element () denotes the probability of participation of topic in sentence () and bottom component denotes the self-information of sentence in topic (). The element with value indicates that the sentence does not contribute in the topic. The two rows below the matrix show sentences’ entropies and lengths respectively.
Fig. 3(b) shows summary of the document (Fig. 2(a)), with sentences listed in order in which they are selected by E-Summ. Fig. 3(c) shows the order of selection of candidate sentences. E-Summ algorithm starts with the topic of highest entropy, and selects its best representative sentence . Continuing the process with remaining topics (in order), the algorithm adds sentences , , and to the candidate set. Applying Knapsack algorithm on candidate sentences extracts , and for inclusion in summary by maximizing the total sentence score for summary limit of words (Fig. 2(d)). However, total length of sentences , and is words. Since desired summary length is not attained, E-Summ algorithm considers remaining candidate sentence (only in this case) to complete the summary length.
6 Experimental Design
In this section, we describe the experimental setup required for the performance evaluation of the proposed E-Summ algorithm. The algorithm is implemented in Python using Natural Language Toolkit (NLTK), textmining package and Scikit-learn toolkit. All experiments are carried out on an Intel(R) Core(TM) i5-8265U CPU running Windows 10 OS with 8GB RAM.
We use four well known data-sets DUC2001333http://duc.nist.gov, DUC2002††footnotemark: , CNN444CNN and DailyMail corpora contain news articles and were originally constructed by Hermann et al. (2015) for the task of passage-based question answering, and later re-purposed for the task of document summarization.and DailyMail††footnotemark: to evaluate quality of summaries generated by E-Summ algorithm. DUC2001 and DUC2002 data-sets comprise of and documents respectively. Each document in these data-sets has abstractive reference summaries of approximately words. CNN and DailyMail data-sets are divided into training, validation and test sets with and documents respectively. Each document in these two data-sets is accompanied by one reference summary consisting of story highlights. Following previous research work (Narayan et al. (2018); Al-Sabahi et al. (2018); Zhou et al. (2018)), we extract three sentences for a CNN document summary and four sentences for a DailyMail document summary for comparative performance evaluation. Interestingly, all four data-sets consist of news articles. In order to establish the claim of domain independence of E-Summ, we evaluate the performance of E-Summ algorithm on well known CL-SciSumm 2016 data-set (Jaidka et al., 2016) for scientific document summarization.
6.2 Number of Latent Topics
Since E-Summ selects representative sentences from important topics in the latent semantic space, decomposing the document into optimal number of topics is critical to the quality of summary produced by the algorithm. In case the number of sentences desired in summary is known, setting dimensionality of the latent semantic space is straightforward. For example, number of sentences desired for CNN/DailyMail data-sets summaries is three/four, respectively. Accordingly, we set the number of latent topics () to for CNN data-set and for DailyMail data-set, based on the assumption that each sentence briefs one topic. In other scenarios, finding the optimal number of latent topics is a tricky task because it is an unknown and complex function of writing style adopted by the author, desired length of the summary, and the background knowledge of the reader.
The number of topics in a document corresponds to the number of core concepts (topics) or ideas described in the document. Latent Dirichlet Allocation (LDA), which is a well-known generative probabilistic method for topic modeling, is unsuitable because it requires number of latent topics as an input parameter (Blei et al., 2003). Therefore, we propose to find the number of latent topics by employing a community detection algorithm as described below.
A concept (topic) in the document is communicated by a group of semantically related and frequently co-occurring terms. These groups manifest as communities in the co-occurrence graph of the document, wherefore identifying communities in this network can potentially reveal the optimal number of topics in the document. The idea of segmenting the document into latent topics using word co-occurrence network representation of the text has been endorsed in recent works on topic modeling (Kido et al., 2016; Dang and Nguyen, 2018; Gerlach et al., 2018).
Consider the term-sentence matrix for document . Then, is the term-term matrix for , where element denotes the number of the times term occurs with term in the document. Considering matrix as the weighted adjacency matrix representation of the co-occurrence graph of , application of community detection algorithm reveals the groups of terms related in the latent semantic space. Each group (community) maps to a topic discussed in the document. We use Louvain algorithm (Blondel et al., 2008) for detecting communities, a simple and fast algorithm to detect communities, and is based on the heuristic of optimizing modularity.
The number of communities and their sizes are specific to the content and writing style of the document. Longer documents often describe more ideas or topics and therefore reveal more communities compared to shorter documents. We found a healthy correlation of 0.70 between the document length and the number of communities in DUC2001 and DUC2002 (Fig. 4). The deviant behaviour of some documents in both data-sets is due to atypical content and writing style.
The size of the community is suggestive of the extent of coverage of the topic in the document. Larger communities possibly span multiple sentences, forging strong candidature for inclusion in summary. On the other hand, smaller communities denote author ideas that are feeble and limited to fewer sentences. We consider a community of size less than four to be an artifact indicating spurious co-occurrence and do not recognize it as a valid topic.
The above stated heuristic sometimes results into inadequate number of communities to complete the summary (fewer than three555We envisage that this threshold may vary with the desired summary length.). In such a situation, the number of latent topics is computed by taking into account the required summary length and the average sentence length as follows.
To summarize the above discussion, we recommend the following methods to determine the value of number of latent topics, .
When the summary length is given as number of sentences, is set to desired summary length.
When the summary length is specified in number of words, is computed using community detection method described above.
When the community detection method results in an inadequate number of communities, is computed using Eq. 8.
6.3 Competing Methods
To the best of authors’ knowledge, no published work for single document extractive summarization evaluates performance on all four data-sets. For comparison with the state-of-the-art, we group algorithms based on the data-sets on which they are evaluated. Accordingly, for DUC2001 data-set, COSUM (Alguliyev et al. (2019)) and method by Saini et al. (2019) are used as competing methods and for DUC2002, CoRank+ (Fang et al. (2017)) and method by Saini et al. (2019) are used. Interestingly, no unsupervised method is evaluated on CNN and DailyMail data-sets. We, therefore, compare the performance of E-Summ against supervised methods for these two data-sets. We use our earlier work on NMF based document summarization as baseline performance (Khurana and Bhatnagar (2019)). E-Summ performance on scientific and generic articles is compared with recent algorithms detailed in Section 7.6.
6.4 Evaluation Metrics
ROUGE is the de-facto metric for qualitative assessment of automatic summarization algorithms, and ROUGE toolkit is the most commonly used software package for this purpose (Lin, 2004). ROUGE performs content-based
evaluation by matching uni-grams, bi-grams and n-grams between system and reference (human-produced) summaries. It generates three metrics viz. recall, precision, and F-measure while evaluating system summary against reference summary.
Following previous studies, we compute recall metric for evaluating DUC2001, DUC2002 documents summaries and F-measure for evaluation of CNN, DailyMail documents summaries. All reported ROUGE scores are macro-averaged over the data-set. For each ROUGE metric, we use three variations viz. ROUGE-1 (R-1), ROUGE-2 (R-2) and ROUGE-L (R-L) for performance evaluation. R-1 finds overlapping uni-grams, R-2 identifies bi-grams, and R-L locates longest common n-gram between system and reference summaries.
Difference between the vocabularies of system and reference summaries underestimates the ROUGE score. Further, ROUGE does not work when reference summaries are not available. During the last two years, there has been a spurt in research related to metrics for summary quality (Peyrard, 2019b; Bhandari et al., 2020b; Huang et al., 2020; Vasilyev and Bohannon, 2020; Fabbri et al., 2020; Bhandari et al., 2020a). Most of these works have argued against the ROUGE metric because it fails to robustly match paraphrases resulting in misleading scores, which do not correlate well with human judgements (Zhang et al., 2019; Huang et al., 2020). Zhong et al. (2020a) argue that high semantically similarity with the source document is highly desirable for a good summary.
6.4.1 Semantic Similarity
The current churn in the summary quality evaluation metrics prompts us to employ semantic similarity as an additional measure. Typically, reference summaries are produced by humans and hence are abstractive666All four data-sets commonly used for evaluation of extractive summarization systems sport abstractive summaries as reference.. Lin and Hovy (2002) observe low inter-human agreement of approximately in single document summarization task and in multi-document summarization task for DUC2001 data-set. The authors advocate evaluation against more reference summaries for higher confidence in the quality of extractive summaries.
Consequent to the above observations, we supplement our investigation by evaluating system summaries using semantics based automatic evaluation measure proposed by Steinberger and Ježek (2012). This method evaluates system summaries w.r.t original document instead of reference summaries to mitigate the challenges related to - (i) few reference summaries, and (ii) difference in the vocabulary of system and reference summaries.
Steinberger and Ježek (2012) propose Latent Semantic Analysis (LSA) based content
evaluation measure to evaluate a summary w.r.t the original document. LSA is a Singular Value Decomposition (SVD) based technique, which reveals latent semantic space of the document. We briefly describe thecontent-based LSA measure for summary evaluation below.
Given the binary term-sentence matrix , application of SVD yields , where is term-topic matrix, is diagonal matrix and is a topic-sentence matrix (here, ). The diagonal elements in denote the importance of topics in descending order. For computing the importance of terms in latent space, the method computes matrix as follows.
Each element in matrix quantifies the contribution of term weighted by the importance of the latent topic . The overall importance of is computed as . The term vector for complete document is given as .
The term vector is computed for system summary and the original document. The cosine of angle between two resulting vectors represents semantic similarity between system summary and the original document.
We examine the efficacy of semantic similarity as a measure of automatic evaluation of summary quality. Using DUC2001 and DUC2002 data-sets, we generate E-Summ summaries of varying lengths (10%, 20%, …,100% of original document) for each document. Next, we compute the semantic similarity of E-Summ summaries w.r.t original documents and plot the macro-averaged values in Fig. 5. It is clearly seen that increasing summary length improves semantic similarity, approaching for 100% summary length. This experiment validates the efficacy of semantic similarity as a measure for evaluating quality of the summary.
7 Performance Evaluation
We report ROUGE scores of summaries, macro-averaged over all documents in the collection, along with respective standard deviations. We present dataset-wise comparative evaluation ofE-Summ algorithm in Sections 7.1 to 7.3, followed by quantitative assessment of E-Summ summaries using semantic similarity in Section 7.4. We substantiate the claim of language independence in Section 7.5.
7.1 Performance on DUC2001 data-set
Table 3 shows results of comparative evaluation of E-Summ algorithm on DUC2001 data-set. To the best of our knowledge, no deep neural method for document summarization has been evaluated on DUC2001 data-set. This limits comparison of the performance evaluation of E-Summ algorithm with only unsupervised methods.
|Baseline Methods||NMF-TR (2019)||44.7 0.1||15.9 0.1||39.3 0.1|
|NMF-TP (2019)||43.7 0.1||15.6 0.1||38.5 0.1|
|Proposed Method||E-Summ||45.6 0.1||15.71 0.1||40.22 0.1|
|Unsupervised Methods||TextRank (2004)||43.71||16.63||38.77|
|Saini et al. (2019)||50.24||29.24||-|
Table 3 shows that the quality of summaries generated by E-Summ algorithm is better than those generated by the baseline methods, TextRank (Mihalcea and Tarau, 2004) and LexRank (Erkan and Radev, 2004) methods except for slightly degraded R-2 performance. TextRank and LexRank are two classical algorithms777We use the implementation available at https://github.com/miso-belica/sumy. for extractive document summarization and present strong baselines. Next, we choose two recent unsupervised methods, which co-incidentally formulate document summarization as an optimization problem. Both methods are collection- and domain- independent like E-Summ, but extensively use language tools.
The first method, COSUM, formulates document summarization as clustering based optimization problem and employs an adaptive differential evolution algorithm to solve it (Alguliyev et al. (2019)
). The method identifies topics in the document and clusters sentences using k-means algorithm. Subsequently, it selects sentences from clusters using an objective function that maximizes coverage and diversity in summary. The second method, proposed bySaini et al. (2019), employs multi-objective binary differential evolution based optimization strategy to generate document summary. The method simultaneously optimizes multiple summary attributes such as similarity of sentences with document title, position of summary sentences in the document, sentence length, cohesion, and coverage of summary. Both methods use variation of genetic approach to find a global optimum solution.
Table 3 reveals that recent competing methods (Saini et al. (2019); Alguliyev et al. (2019)) have higher R-1 and R-2 scores than E-Summ algorithm but do not report R-L score. Further, both methods require high computational effort to obtain the optimal solution. The average running time per document reported by Saini et al. (2019) for this data-set is seconds (Page 20 of the reference), which excludes the time required for computation of sentence similarity. Running time for COSUM method is not reported by the authors. We noted that running time per document for E-Summ as 888Reported configuration of the machine used by Saini et al. (2019) is superior to the one used in our experiments. second for this data-set, averaged over five runs. This includes the time required to compute number of latent topics using community detection method and extraction of summary sentences using Knapsack algorithm.
7.2 Performance on DUC2002 data-set
Performance evaluation of E-Summ algorithm for DUC2002 data-set is presented in Table 4. Though the proposed algorithm performs comparably to the baseline methods, it has mixed performance compared to unsupervised and deep neural methods.
|Baseline Methods||NMF-TR (2019)||49.0 0.1||21.5 0.1||44.1 0.1|
|NMF-TP (2019)||47.6 0.1||19.7 0.1||42.4 0.1|
|Proposed Method||E-Summ||50.39 0.1||21.16 0.1||45.19 0.1|
|Unsupervised Methods||TextRank (2004)||48.33||22.54||43.75|
|Saini et al. (2019)||51.66||28.85||-|
|Deep Neural Methods||NN-SE (2016)||47.4||23.0||43.5|
|SummaRuNNer (2017)||46.6 0.8||23.1 0.9||43.03 0.8|
We observe that E-Summ algorithm performs better than both TextRank and LexRank methods except for R-2 performance of TextRank. Algorithm CoRank+, which is language independent like the E-Summ and uses a graph based approach, performs better than the proposed algorithm. CoRank+ augments sentence-sentence and word-sentence relationship in the document, which proves to be an advantageous strategy (Fang et al. (2017)). R-1 score of E-Summ is better than that of COSUM (Alguliyev et al. (2019)) method, while R-2 score is lower. Algorithm proposed by Saini et al. (2019) has better R-1 and R-2 performance compared to E-Summ algorithm. Though both COSUM and method by Saini et al. (2019) perform better than the proposed method, their edge over E-Summ is unclear in absence of evaluation report on other data-sets and R-L metric.
CoRank+ reports average running time per document for DUC2002 data-set to be seconds (Sec. 4.3 of Fang et al. (2017)). Average running time per document reported by Saini et al. (2019) for this data-set is seconds (Page 20 of the reference), excluding time to compute sentence similarity. E-Summ on an average summarizes a DUC2002 data-set document in second including the time to compute number of latent topics and extracting summary sentences using Knapsack algorithm. The reported execution time is averaged over five runs.
It is also evident from Table 4 that between HSSAS (Al-Sabahi et al. (2018)) and E-Summ, former is the clear winner. However, compared to the other three methods (Cheng and Lapata (2016); Nallapati et al. (2017); Yao et al. (2018)), E-Summ has mixed performance. Considering that all four deep neural methods are trained and tuned on CNN/DailyMail data-set, their cross-collection performance is remarkable.
7.3 Performance on CNN and DailyMail data-sets
Table 5 presents the results of comparative evaluation of E-Summ algorithm on combined CNN and DailyMail data-sets. Performance of E-Summ is better than baseline NMF-TP method but lower than that of NMF-TR method. The reason is that E-Summ being a topic-oriented method, picks sentences from the most informative topics, while CNN and DailyMail documents being news articles, do not have clearly demarcated topics. Consequently, E-Summ suffers the disadvantage compared to NMF-TR, which being term-oriented (Sec. 7 of Khurana and Bhatnagar (2019)), is able to capture better summary sentences.
|Baseline Methods||NMF-TR (2019)||34.2 0.1||13.2 0.1||31.0 0.1|
|NMF-TP (2019)||30.4 0.1||10.9 0.1||27.4 0.1|
|Proposed Method||E-Summ||30.97 0.1||10.94 0.1||27.78 0.1|
|Unsupervised Methods||TextRank (2004)||31.88||11.80||28.74|
|Deep Neural Methods||NEUSUM (2018)||41.59||19.01||37.98|
In absence of evaluation of recent unsupervised extractive single document summarization methods on CNN and DailyMail data-sets, we compare the performance of E-Summ with TextRank and LexRank algorithms. We also choose recent Neural summarization algorithms in this category - NEUSUM (Zhou et al., 2018), REFRESH (Narayan et al., 2018), HSSAS (Al-Sabahi et al., 2018), DQN (Yao et al., 2018), BANDITSUM (Dong et al., 2018), SemSim (Yoon et al., 2020) and GSum (Dou et al., 2021) for comparison.
Table 5 reveals that there is a marginal quality gap between E-Summ and TextRank and LexRank summaries. As expected, all deep neural methods demonstrate significantly better performance than E-Summ. We recognize that advances in deep neural methods have been a powerful driver of NLP research in recent years and have particularly benefited automatic text summarization on the benchmark data-sets. We discuss the insights gained from this experiment in Sec. 7.4 and Sec. 8. We report per document running time of E-Summ as second, averaged over five runs. Recall that the required summary length for these data-sets is given as number of sentences. Consequently, community detection and Knapsack algorithms are omitted and E-Summ executes expeditiously.
7.4 Semantic Similarity with Original Document
In this section, we assess the quality of E-Summ summaries using content-based semantic similarity method described in Sec. 6.4.1.
show comparative distributions of similarity scores for system and reference summaries. The plots clearly reveal that for all data-sets the minimum, maximum, average, first and third quartile scores for reference summaries are lower than those for system summaries. It is therefore reasonable to conclude thatE-Summ summaries carry higher semantic similarity w.r.t the complete document than the reference summaries for all four data-sets.
|Extractive methods||Abstractive methods|
We also compare the semantic similarity of E-Summ summaries with three unsupervised methods and an equal number of neural methods. Unsupervised methods, TextRank and LexRank offer a strong baseline as evident by ROUGE results presented earlier (Tables 3 - 5). LSARank7 (Steinberger et al., 2004) is included for comparison because of its topic modelling based approach for summarization, which is similar to the approach followed by E-Summ. Three neural methods - BANDITSUM, NEUSUM, and REFRESH, which exhibit high ROUGE scores (Table 5) are included for comparison because of availability of their CNN & DailyMail summaries999We used BANDITSUM, NEUSUM, REFRESH summaries of CNN and DailyMail data-sets available at https://github.com/Yale-LILY/SummEval (Fabbri et al., 2020)..
It is evident from Table 6 that semantic similarity scores of E-Summ summaries are highest among extractive methods for DUC data-sets. E-Summ also beats all the methods for CNN and DailyMail data-sets. Evidently, E-Summ summaries capture more representative textual information than those generated by the selected competing methods.
7.5 Language Independence of E-Summ
In this section, we substantiate the claim of language independence of E-Summ using the documents in two Indian languages (Hindi and Marathi) and three European languages (German, Spanish and French). All language documents, except Hindi, are sourced from Multiling101010http://multiling.iit.demokritos.gr/pages/view/1571/datasets data-sets. Multiling is a community-driven initiative to promote NLP research in languages other than English (Kubina et al., 2013; Giannakopoulos et al., 2015, 2017b).
India is a country with a diverse set of official languages111111https://www.mha.gov.in/sites/default/files/EighthSchedule_19052017.pdf. Most of these languages have limited language resources for performing common NLP tasks. Developing technologies for these languages is a thrust area for Government in the country. We choose Hindi for investigating the language independence of E-Summ because it is the most-spoken121212https://www.censusindia.gov.in/2011Census/C-16_25062018_NEW.pdf, Page-6 language in India, and also the official language131313http://www.mea.gov.in/Images/pdf1/Part17.pdf alongside English. However, to the best of authors’ knowledge no Hindi benchmark data-set is available for single document summarization. We also experiment with Marathi language documents, which is the official and spoken language in Maharashtra, a state situated in the south-west part of India. Marathi language documents for single document summarization are available in Multiling 2017 data-set.
We employ Google translate technology to scrutinize the capability of E-Summ for Hindi documents and present detailed results for two documents. We first convert the English document to Hindi, sentence-by-sentence, using Google translate. Next, we tokenize sentences, filter punctuation and stop-words141414We use Hindi language stop-words list from https://github.com/taranjeet/hindi-tokenizer/blob/master/stopwords.txt (Kunchukuttan, 2020). to create a binary term-sentence incidence matrix of the translated document. We generate E-Summ summary of the translated document and translate the Hindi summary back to English for evaluation using the standard ROUGE toolkit and semantic similarity measure. Fig. 7 describes the pipeline to assess the performance of E-Summ algorithm for Hindi documents. This somewhat quirky sequence of steps allows summarization of an English document in any language using E-Summ algorithm and quantitatively assess the summary quality. Admittedly, the noise introduced in two back-to-back machine translations is expected to degrade the resulting quality scores.
We select two documents from DUC2002 data-set (Document FBIS4-42178 and AP890606-0116), each with two reference summaries. The documents were selected because of their diverse R-L scores ( and , respectively). Table 7 presents the result of the experiment. Rows 1 - 3 show the lengths of the generated summaries, Row 4 specifies the lengths of the two reference summaries. Rows 5 and 6 show ROUGE scores of the Hindi summaries translated to English and E-Summ summaries of the original documents, respectively. Rows 7 - 9 show semantic similarity with original document for Hindi translated to English, E-Summ English, and reference summaries.
|Summary Length (in words)|
|1||E-Summ Hindi Summary||127||104|
|2||Translated English Summary||96||90|
|3||E-Summ English Summary||123||116|
|ROUGE (R-1/ R-2/ R-L)|
|5||Translated English Summary||72.28/39.0/71.29||21.21/3.06/18.18|
|6||E-Summ English Summary||85.15/58.0/83.17||31.31/6.12/29.29|
|Semantic Similarity with Document (in %)|
|7||Translated English Summary||73.87||51.22|
|8||E-Summ English Summary||93.99||63.65|
As expected, the quality of translated Hindi summaries151515The original documents and summaries are presented in Appendix A.1. for two documents is lower than those of corresponding English summaries using both semantic similarity and ROUGE score. Two successive machine translations alter the vocabulary resulting in lower lexical overlap with reference summaries and diminish semantic similarity with the original document. Reduction in summary lengths due to translation (Rows 1 and 2) further degrades the performance compared to E-Summ English summaries. These experiments assert competence of E-Summ algorithm to summarize a non-English a document with reasonable quality scores.
Multiling 2017 data-set (Giannakopoulos et al., 2017a) contains thirty Marathi language documents with corresponding summary lengths. We pre-process these documents by tokenizing sentences161616We use sentence tokenizer and word tokenizer implementation from https://github.com/anoopkunchukuttan/indic_nlp_library., filtering punctuation and removing stop-words171717We use Marathi language stop-words list from https://github.com/stopwords-iso/stopwords-mr/blob/master/stopwords-mr.txt. to create the binary term-sentence incidence matrix for each document and apply E-Summ algorithm. Since there are no gold standard summaries available, we assess the summary quality by finding the semantic similarity with the original text. The average semantic similarity of E-Summ summaries with the original document is found to be . E-Summ summaries of Marathi documents with best and least performance are presented in Appendix A.2.
|Multiling 2013 Best||34.00||4.00||-||-|
We further experiment with three European languages - German, Spanish and French to scrutinize the language-independence feature of E-Summ algorithm. We consider the test set documents in Multiling 2013 data-set for the three languages, each having thirty documents along with the gold standard reference summaries. We compare the ROUGE and semantic similarity of E-Summ summaries with those of TextRank and Lead-3, both of which offer strong baseline performances. Following the ROUGE evaluation criteria used in Multiling 2013 shared task (HSS scoring), we truncate the length of system summaries to the length of gold standard reference summaries. Of the three European languages used for our experiments, the participant systems of Multiling 2013 report results only for German language documents. Accordingly, we compare the performance of German test set documents with the winning ROUGE scores of Multiling 2013 (Kubina et al., 2013). Due to the non-availability of results and summaries for Spanish and French documents, we compare the ROUGE performance with TextRank and Lead-3.
Table 8 presents the comparative performance evaluation of E-Summ algorithm for the three languages. We observe that E-Summ exhibits weak performance compared to the winners of Multiling 2013 for German (System AIC for ROUGE-1 in Fig. 1 and System MD1 for ROUGE-2 in Fig. 2 of Kubina et al. (2013)). However, ROUGE performance of E-Summ is better than TextRank and Lead-3 algorithms for the three languages, except for slight degradation in ROUGE-2 score for German language documents. We also note higher values of semantic similarity scores of E-Summ summaries compared to TextRank and Lead-3 summaries.
7.6 Domain Independence of E-Summ
We summarize science articles and generic documents to substantiate the claim of domain independence of E-Summ algorithm.
We use articles from CL-SciSumm Shared Task data-sets 2016 - 2020 (Jaidka et al., 2016, 2017, 2019; Chandrasekaran et al., 2019, 2020), each split into training, development, and test sets. Each article is accompanied by three types of summaries - (i) abstract, written by the author of the paper, (ii) community summary, created using citation spans of the paper, and (iii) human-written summaries by the annotators. We evaluate the performance of E-Summ on the test sets w.r.t human written reference summaries of these documents. Based on the earlier empirical observation that Abstract, Introduction, and Conclusion sections of the scientific articles are germane for summary (Kavila and Radhika, 2015; Cachola et al., 2020), we selectively target these sections for generic, concise, and informative sentences to be included in summaries.
Columns 2 and 3 in Table 9 present 2-F and SU4-F ROUGE scores of E-Summ summaries against the published scores of the best algorithm for each year. We report SOTA performance (Yasunaga et al., 2019) on CL-SciSumm 2016 test set. Despite the absence of training and extraneous knowledge, E-Summ performs better than the base model trained over thirty documents. However, the performance of the model trained over documents is significantly superior, confirming the importance of large training sets for high quality performance of neural summarization methods.
|Yasunaga et al. (2019)*||18.46||12.21|
|Yasunaga et al. (2019)**||31.54||24.36|
|CIST (Huang, 2017; Jaidka et al., 2017)||27.50||17.80|
|Test sets for 2018-2020 are same.|
|LaSTUS/TALN+INCO (Bravo et al., 2018; Jaidka et al., 2019)||28.80||24.00|
|CIST (Li et al., 2019; Chandrasekaran et al., 2019)||27.80||20.00|
|AUTH (Chandrasekaran et al., 2020)||22.00||-|
Performance comparison of E-Summ for CL-SciSumm 2016-2020 test sets. 2-F: F-score ROUGE-2, SU4-F: F-score ROUGE-SU4. Best performance is shown in boldface. ‘-’: score not reported, * base model trained on 30 documents, ** best model trained on 1000 documents.
For CL-SciSumm 2017 data-set, performance of E-Summ is lower than the best scoring system proposed by Huang (2017)
. The winning method employs statistical features, estimate feature weights and ensure non-redundancy using Determinantal Point Processes (DPPs) based sentence sampling to select sentences for summary. The test sets for CL-SciSumm 2018 - 2020 are same, and hence there is one row forE-Summ performance. Winning system for CLSciSumm 2018 (Bravo et al., 2018) beats E-Summ
. The method employs convolutional neural network to learn relation between context based document features, and uses likelihood based scoring function. The dip in the performance of winning systems for CL-SciSumm 2019 and CL-SciSumm 2020 is unexpected.Li et al. (2019) employ statistical feature model and neural language model to extract summary sentences using DPP sampling. Gidiotis et al. (2020) follow an abstractive summarization technique using PEGASUS model pre-trained on the arXiv data-set and generate the summary of scientific article based on abstract and cited text spans.
Admittedly, summaries generated by the best model proposed by (Huang, 2017; Bravo et al., 2018; Li et al., 2019; Yasunaga et al., 2019) are significantly better than those produced by E-Summ. However, frugality in terms of human curated knowledge makes E-Summ particularly attractive for summarizing new research articles with little or no citation information available.
WikiHow is a large scale data-set consisting of article-summary pairs prepared from an online knowledge base, written by different human authors and covering a wide range of topics with diverse writing styles (Koupaee and Wang, 2018). The data-set is divided into training, validation and testing sets consisting of documents respectively. Following previous works (Zhong et al., 2020b; Dou et al., 2021), we extract four sentences for E-Summ summaries of WikiHow documents.
|MatchSum (Zhong et al., 2020b)||31.85||8.98||29.58|
|CUPS (Desai et al., 2020)||30.94||9.06||28.81|
|LFIP-SUM (Jang and Kang, 2021)||24.28||5.32||18.69|
|A||BART (Lewis et al., 2019)||41.46||17.80||39.89|
|GSum (Dou et al., 2021)||41.74||17.73||40.09|
Table 10 presents comparative evaluation of E-Summ with recent methods evaluated on WikiHow data-set. BART and GSum exhibit SOTA performance on this data-set. Bert based extractive neural summarization methods (Zhong et al., 2020b; Desai et al., 2020) lose by a wide margin of . LFIP-SUM and E-Summ
have comparable performance, which is clearly weak. LFIP-SUM is an unsupervised extractive summarization method, which formulates summarization as an integer linear programming problem based on pre-trained sentence embeddings and uses Principal Component Analysis for sentence importance and extraction.
Since, WikiHow summaries are highly abstractive with average compression ratio of (Koupaee and Wang, 2018), abstractive summarization methods are expected to score much better than extractive methods. High extent of abstraction and lexical variability in WikiHow gold standard reference summaries inevitably lower ROUGE performance of extractive summarization methods as revealed by our investigation.
State-of-the-art in Document Summarization
The vast majority of extractive, non-neural summarization algorithms use four data-sets for performance evaluation, exhibiting an interesting pattern. Unsupervised summarization methods majorly evaluate performance on DUC data-sets (Fang et al. (2017); Alguliyev et al. (2019); Saini et al. (2019) among other recent works), while deep neural summarization methods use CNN and DailyMail data-sets. However, some neural methods train on CNN and DailyMail data-sets and test performance on DUC2002 considering it as out-of-domain data-set (Cheng and Lapata (2016); Nallapati et al. (2017); Zhou et al. (2018); Dong et al. (2018); Al-Sabahi et al. (2018)). Recently added NEWSROOM data-set (Grusky et al., 2018) consisting of news articles and corresponding human written reference summaries are gradually gaining popularity among summarization research community.
Table 11 lists performance scores of E-Summ algorithm and the winner for each data-set, thereby recording the gap between E-Summ and best performance in terms of ROUGE scores. Entries marked ‘-’ indicate missing evaluation of the algorithm, and ‘*’ indicates that the algorithm is not the top-scorer for the corresponding data-set.
|DUC2001||DUC2002||CNN & DailyMail|
|Saini et al. (2019)||50.24||29.24||-||*||28.85||-||-||-||-|
GSum (Dou et al., 2021) delivers state-of-the-art performance (all three ROUGE scores) for one data-set. The method uses pre-trained BART with additional guidance as input to control the output. Notably, the selective presentation of evaluation results for data-sets makes fair comparison of algorithms difficult. Under the present circumstances, a method can at best be designated state-of-the-art for a specific data-set. It is reasonable to conclude that state-of-the-art for generic extractive summarization is yet to be achieved.
E-Summ algorithm, which is well-grounded in information theory, generates summaries with high semantic similarities, even though it suffers from relatively lower lexical matching measured by ROUGE metric. We believe that information theoretic methods like E-Summ have a high potential to evolve and occupy space in the bouquet of summarization algorithms.
Deep Neural Vs. Non-neural methods
Experiments reported in Sec. 7 show that ROUGE scores of summaries generated by deep neural methods are generally higher for all data-sets, suggesting that these methods hold more promise than their unsupervised counterparts.
Understandably, the advantage comes with a price of long model training time and training-data preparation time for neural methods. Dependence on language models, domain- and collection dependence, lack of transparency, and interpretability are ancillary costs of these methods. Furthermore, high ROUGE score performance exhibited by recent neural methods forspecific data-sets gives an inkling of weak summarization abilities of these methods akin to weak AI.
Though neural methods easily beat E-Summ algorithm, E-Summ establishes the promise of information theoretic approach for unsupervised extractive document summarization. Vanilla E-Summ can be bolstered by effective sentence selection methods to tear apart distracting topics and reveal semantically most relevant sentences. Some interesting ideas that can fortify E-Summ include DPP for sentence selection (Huang, 2017; Li et al., 2019), graph-based methods for sentence selection (Zheng and Lapata, 2019; Gupta et al., 2019), multi-objective optimization for sentence selection (Saini et al., 2019; Mishra et al., 2021), BART for abstraction (Chaturvedi et al., 2020; Dou et al., 2021), using embedding based similarity for reducing redundancy (Hailu et al., 2020; Zhong et al., 2020b) etc.
Comparison of Document Summarization Applications
Applications of text summarization have been rising monotonically and will continue to do so in the foreseeable future. There exists a considerable demand for summarization tools in diverse domains and genres, and more so in the context of purpose (Kanapala et al. (2019)). Existing summarization methods do not take cognizance of either genre or purpose, thereby missing salient cues exposed by structure, writing style, vocabulary, etc. They also are oblivious to the requirement of different types of summaries based on evolving user’s needs (Lloret and Palomar (2012)).
Most existing deep neural summarization methods are trained on documents belonging to the genre “news” and their performance over scientific articles, literary documents, blogs and web pages, reports, letters and memos, etc. is yet to be examined. However, once trained appropriately, deep neural methods hold high promise for high precision summarization of scholarly documents in science, technology, social science, law and international relations, etc. Unsupervised summarization methods like COSUM, CoRank+, E-Summ etc. are independent of collection, with the design inspired by intuitive ideas for summarization by humans, and backed by sound computational techniques. Consequently, they are expected to exhibit more predictable performance across genres.
On-the-fly summarization for web documents is one of the most desired text analytics tasks in the current era. With deeper penetration of digital services, improving literacy rates, advancing language technology, more people are accessing the internet to stay connected via online social networks, to communicate in office and personal domains, satisfy their knowledge needs, etc. Summarization integrated with browsers may become as commonplace as language translation tools over time. To meet this requirement, it is pragmatic to develop language agnostic, lean, and fast methods capable of summarizing generic documents. Recently, Dhaliwal et al. proposed a device based model that employs character-level neural architecture for extractive text summarization (Dhaliwal et al., 2021). So far, real time on-device summarization is an un-chartered territory and we envisage rapid developments in this direction.
A summary is informative for a user if it adds to her personal knowledge. Peyrard claims that once the reader’s background knowledge is modeled, it can be blended with an information-theoretic framework to generate informative personalized summaries (Peyrard, 2019a). Generating user-centric summaries entails capturing user background knowledge () in terms of semantic units and identifying those that maximize relevance and minimize redundancy. Minimizing KL-Divergence between the distribution of the semantics units in the summary (S) and original document (D) addresses relevance and redundancy, while the amount of new information contained in a summary is given by the cross-entropy between and S. This is a promising line of research and requires addressing additional challenges in the area human-computer interactions.
Limitations of E-Summ
is an efficient algorithm to generate extractive summaries without any dependence on extraneous knowledge and with no training, tuning and feature selection overheads. Besides, the method holds the promise of domain and language independence.
|Document Statistics||Execution Time|
|Size (KB)||A (m n)||#topics||NMF||Knapsack||Total|
Intuitively, a long document would generate a high dimensional term-sentence matrix and a higher number of latent topics, raising the memory requirement and computational expense of NMF. To investigate the scalability of E-Summ, we select twelve documents of different lengths from DUC2002 data-set, and note the size of term-sentence matrix, number of topics and break-up of time for factorization and sentence selection (Table 12). It is evident that dimensions of term-sentence matrix and number of latent topics increase with size of the document, as expected. However, the computational expense of NMF does not increase significantly with the size of the input matrix, thanks to efficient NMF solvers. Since the size of summary is small and same (100 words) for all documents, execution time for Knapsack algorithm is negligible and nearly same for all documents. However, E-Summ algorithm suffers from following two limitations.
Long Summaries: E-Summ is not suitable for generating long summaries. Extracting a long summary requires solution for large size knapsack resulting in high execution cost of Knapsack algorithm. We generate summaries of varying lengths for the longest document (LA011990-0091 - 3181 words) in DUC2002 data-set, and observe the timing for execution of Knapsack algorithm. Table 13 reveals that if long summaries are required than sentence selection by Knapsack algorithm is the bottleneck.
L (words) 100 200 300 400 500 600 700 Knapsack 5.216E-04 1.091E-03 1.309E-03 1.268E-03 1.637E-01 4.050E+00 1.585E+02 Total 3.399E-01 3.509E-01 3.456E-01 3.469E-01 6.860E-01 4.680E+00 1.592E+02 Table 13: Execution time of Knapsack algorithm (in seconds) for document no. LA011990-0091 in DUC2002 data-set for different summary lengths (L).
Highlight Extraction: E-Summ is not suitable for generating highlights or TLDR summaries. This is evident from the weak results on WikiHow and CNN/Daily mail data-sets, both of which have abstracted highlights of the articles. Scope of summarization methods is subtly different from that of highlights generation task. While both are expected to have document wide coverage, summary of an article is expected to be smooth and coherent, whereas highlights are short sentences that provide a focused overview of the article. The two have been recognized as two independent tasks with contrasting objectives, particularly in the context of scientific articles (Cagliero and La Quatra, 2020). Extractive summarization methods are not suitable for highlight generation, and need to be blended with abstractive summarization models.
In this paper, we propose an information theoretic approach for unsupervised, extractive single document summarization. We decompose the binary term-sentence matrix for a document using non-negative matrix factorization and adopt probabilistic perspective of the resulting factor matrices. We extract probability distributions of sentences and topics in latent space, the principal semantic units constituting the document. We then leverage entropy of topics and sentences for generation of the document summary. The proposed E-Summ algorithm is domain-, collection-independent and is agnostic to the language of the document. Moreover, the method is explainable and fast enough to meet real-time requirements for on-the-fly summarization of web documents in languages other than English.
We present a comprehensive performance analysis of the proposed method on four well-known public data-sets and present the results dataset-wise. The reported experiments reveal that the combination of NMF and information theory begets advantages of speed and transparency but falls short of comparative performance measured by ROUGE score. Since all gold-standard summaries are abstractive (human generated), ROUGE score measurement for extractive summary has assurance deficit. We proceed to measure semantic similarity of E-Summ summary with the complete document and observe that it is higher than that of the reference summaries for all data-sets.
Encouraged by our results in Sec. 7.6, we present E-Summ summaries of 150181818Summary length of both the sections is slightly long because the last sentence selected for summary overshoots the summary length. words for two key sections of this article in Fig. 8. We remove mathematical equations, figures and tables, and determine the number of latent topics using community detection method (Sec. 6.2). Sentences, in summary, are presented in the order in which they appear in the article. We leave it to the reader to judge the quality of summary.
We sincerely thank the anonymous reviewers for their valuable and constructive feedback, which has led to substantial improvement in the quality of paper. We also thank the authors of winning teams of CL-SciSumm Shared task 2018 and 2019 for sharing the summaries.
A hierarchical structured self-attentive model for extractive document summarization (hssas). IEEE Access 6, pp. 24205–24212. Cited by: §6.1, §7.2, §7.3, §8.
- MCMR: maximum coverage and minimum redundant text summarization model. Expert Systems with Applications 38 (12), pp. 14514–14522. Cited by: §1.
- COSUM: text summarization based on clustering and optimization. Expert Systems 36 (1), pp. e12340. Cited by: §1, §1, §6.3, §7.1, §7.1, §7.2, §8.
- Re-evaluating evaluation in text summarization. arXiv preprint arXiv:2010.07100. Cited by: §1, §6.4.
- Metrics also disagree in the low scoring range: revisiting summarization evaluation metrics. arXiv preprint arXiv:2011.04096. Cited by: §1, §6.4.
Latent dirichlet allocation.
Journal of machine Learning research3 (Jan), pp. 993–1022. Cited by: §6.2.
- Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008 (10), pp. P10008. Cited by: §6.2.
- SVD based initialization: a head start for nonnegative matrix factorization. Pattern Recognition 41 (4), pp. 1350–1362. Cited by: §5.1.
- Lastus/taln+ inco@ cl-scisumm 2018-using regression and convolutions for cross-document semantic linking and summarization of scholarly literature. In BIRNDL@ SIGIR, Cited by: §7.6, §7.6, Table 9.
- Tldr: extreme summarization of scientific documents. arXiv preprint arXiv:2004.15011. Cited by: §7.6.
- Extracting highlights of scientific articles: a supervised summarization approach. Expert Systems with Applications 160, pp. 113659. Cited by: item ii.
- Overview and insights from the shared tasks at scholarly document processing 2020: cl-scisumm, laysumm and longsumm. In Proceedings of the First Workshop on Scholarly Document Processing, pp. 214–224. Cited by: §7.6, Table 9.
- Overview and results: cl-scisumm shared task 2019. arXiv preprint arXiv:1907.09854. Cited by: §7.6, Table 9.
- Divide and conquer: from complexity to simplicity for lay summarization. In Proceedings of the First Workshop on Scholarly Document Processing, pp. 344–355. Cited by: §8.
- Neural summarization by extracting sentences and words. arXiv preprint arXiv:1603.07252. Cited by: §7.2, §8.
- ComModeler: topic modeling using community detection.. In EuroVA@ EuroVis, pp. 1–5. Cited by: §6.2.
- Compressive summarization with plausibility and salience modeling. arXiv preprint arXiv:2010.07886. Cited by: §7.6, Table 10.
- On-device extractive text summarization. In 2021 IEEE 15th International Conference on Semantic Computing (ICSC), Vol. , Los Alamitos, CA, USA, pp. 347–354. External Links: Cited by: §8.
- BanditSum: extractive summarization as a contextual bandit. arXiv:1809.09672. Cited by: §1, §7.3, §8.
- GSum: a general framework for guided neural abstractive summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Cited by: §1, §7.3, §7.6, Table 10, §8, §8.
Lexrank: graph-based lexical centrality as salience in text summarization.
Journal of artificial intelligence research22, pp. 457–479. Cited by: §7.1.
- A survey on evaluation of summarization methods. Information processing & management 56 (5), pp. 1794–1814. Cited by: §1.
- SummEval: re-evaluating summarization evaluation. arXiv preprint arXiv:2007.12626. Cited by: §1, §6.4, footnote 9.
- Word-sentence co-ranking for automatic extractive text summarization. Expert Systems with Applications 72, pp. 189–195. Cited by: §1, §6.3, §7.2, §7.2, §8.
- A network approach to topic models. Science advances 4 (7), pp. eaaq1360. Cited by: §6.2.
- Multiling 2017 overview. In Proceedings of the MultiLing 2017 workshop on summarization and summary evaluation across source types and genres, pp. 1–6. Cited by: §7.5.
- Multiling 2017 overview. In Proceedings of the MultiLing 2017 workshop on summarization and summary evaluation across source types and genres, pp. 1–6. Cited by: §7.5.
- Multiling 2015: multilingual summarization of single and multi-documents, on-line fora, and call-center conversations. In Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 270–274. Cited by: §7.5.
- AUTH@ clscisumm 20, laysumm 20, longsumm 20. In Proceedings of the First Workshop on Scholarly Document Processing, pp. 251–260. Cited by: §7.6.
- Deep learning. MIT press. Cited by: §5.1.
- Newsroom: a dataset of 1.3 million summaries with diverse extractive strategies. arXiv preprint arXiv:1804.11283. Cited by: §8.
Entailment and spectral clustering based single and multiple document summarization. International Journal of Intelligent Systems and Applications 10 (4), pp. 39. Cited by: §8.
- Text summarization through entailment-based minimum vertex cover. In Proceedings of the Third Joint Conference on Lexical and Computational Semantics (* SEM 2014), pp. 75–80. Cited by: §1.
- A framework for word embedding based automatic text summarization and evaluation. Information 11 (2), pp. 78. Cited by: §8.
- Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pp. 1693–1701. Cited by: footnote 4.
- What have we achieved on text summarization?. arXiv preprint arXiv:2010.04529. Cited by: §1, §1, §6.4.
- Cist@ clscisumm-17: multiple features based citation linkage, classification and summarization. In BIRNDL@ SIGIR (2), Cited by: §1, §7.6, §7.6, Table 9, §8.
- The cl-scisumm shared task 2017: results and key insights. In BIRNDL@SIGIR, Cited by: §7.6, Table 9.
Overview of the cl-scisumm 2016 shared task.
Proceedings of the joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (BIRNDL), pp. 93–102. Cited by: §6.1, §7.6.
- The cl-scisumm shared task 2018: results and key insights. arXiv preprint arXiv:1909.00764. Cited by: §7.6, Table 9.
- Learning-free unsupervised extractive summarization model. IEEE Access 9, pp. 14358–14368. Cited by: Table 10.
- Roget’s thesaurus and semantic similarity. Recent Advances in Natural Language Processing III: Selected Papers from RANLP 2003, pp. 111. Cited by: §2.2.
- Text summarization from legal documents: a survey. Artificial Intelligence Review 51 (3), pp. 371–402. Cited by: §8.
- Extractive text summarization using modified weighing and sentence symmetric feature methods. International Journal of Modern Education and Computer Science 7 (10), pp. 33. Cited by: §7.6.
- Entropy-based sentence selection with roget’s thesaurus.. In TAC, Cited by: §2.2.
- Extractive document summarization using non-negative matrix factorization. In International Conference on Database and Expert Systems Applications, pp. 76–90. Cited by: §2.3, §6.3, §7.3.
- NMF ensembles? not for text summarization!. In Proceedings of the First Workshop on Insights from Negative Results in NLP, pp. 88–93. Cited by: §1.
- Topic modeling based on louvain method in online social networks. In Anais Principais do XII Simpósio Brasileiro de Sistemas de Informação, pp. 353–360. Cited by: §6.2.
- Wikihow: a large scale text summarization dataset. arXiv preprint arXiv:1810.09305. Cited by: §7.6, §7.6.
- Acl 2013 multiling pilot overview. In Proceedings of the MultiLing 2013 Workshop on Multilingual Multi-document Summarization, pp. 29–38. Cited by: §7.5, §7.5, §7.5.
- The IndicNLP Library. Note: https://github.com/anoopkunchukuttan/indic_nlp_library/blob/master/docs/indicnlp.pdf Cited by: footnote 14.
- Learning the parts of objects by non-negative matrix factorization. Nature 401 (6755), pp. 788. Cited by: §2.3.
- Automatic generic document summarization based on non-negative matrix factorization. Information Processing & Management 45 (1), pp. 20–34. Cited by: §2.3.
- Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. Cited by: Table 10.
- CIST@ clscisumm-19: automatic scientific paper summarization with citances and facets.. In BIRNDL@ SIGIR, pp. 196–207. Cited by: §7.6, §7.6, Table 9, §8.
- Manual and automatic evaluation of summaries. In Proceedings of the ACL-02 Workshop on Automatic Summarization-Volume 4, pp. 45–51. Cited by: §6.4.1.
- Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. Cited by: §6.4.
- Text summarisation in progress: a literature review. Artificial Intelligence Review 37 (1), pp. 1–41. Cited by: §8.
- The automatic creation of literature abstracts. IBM Journal of research and development 2 (2), pp. 159–165. Cited by: §1.
- Effectively leveraging entropy and relevance for summarization. In Asia Information Retrieval Symposium, pp. 241–250. Cited by: §1, §2.2, item i, §4.3.
- Textrank: bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, Cited by: §7.1.
- Scientific document summarization in multi-objective clustering framework. Applied Intelligence, pp. 1–24. Cited by: §8.
SummaRuNNer: a recurrent neural network based sequence model for extractive summarization of documents.. In AAAI, pp. 3075–3081. Cited by: §1, §1, §7.2, §8.
- Classify or select: neural architectures for extractive document summarization. arXiv:1611.04244. Cited by: §1.
Ranking sentences for extractive summarization with reinforcement learning. arXiv preprint arXiv:1802.08636. Cited by: §1, §6.1, §7.3.
- Topical coherence for graph-based extractive summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1949–1954. Cited by: §1.
- A simple theoretical model of importance for summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1059–1073. Cited by: §1, §1, §1, §1, §2.1, §2.1, §2, §8.
- Studying summarization evaluation metrics in the appropriate scoring range. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5093–5100. Cited by: §1, §6.4.
- Extractive single document summarization using binary differential evolution: optimization of different sentence quality measures. PloS one 14 (11). Cited by: §1, §6.3, §7.1, §7.1, §7.2, §7.2, Table 3, Table 4, Table 11, §8, §8, footnote 8.
- Using latent semantic analysis in text summarization and summary evaluation. Proc. ISIM 4, pp. 93–100. Cited by: §7.4.
- Evaluation measures for text summarization. Computing and Informatics 28 (2), pp. 251–275. Cited by: §6.4.1, §6.4.1.
- Is human scoring the best criteria for summary evaluation?. arXiv preprint arXiv:2012.14602. Cited by: §1, §6.4.
- On the complexity of nonnegative matrix factorization. SIAM Journal on Optimization 20 (3), pp. 1364–1377. Cited by: §5.1.
- A new lsa and entropy-based approach for automatic text document summarization. International Journal on Semantic Web and Information Systems (IJSWIS) 14 (4), pp. 1–32. Cited by: §2.2.
- Deep reinforcement learning for extractive document summarization. Neurocomputing 284, pp. 52–62. Cited by: §7.2, §7.3.
- Scisummnet: a large annotated corpus and content-impact models for scientific paper summarization with citation networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 7386–7393. Cited by: §1, §7.6, §7.6, Table 9.
- Learning by semantic similarity makes abstractive summarization better. arXiv preprint arXiv:2002.07767. Cited by: §7.3.
- Bertscore: evaluating text generation with bert. arXiv preprint arXiv:1904.09675. Cited by: §6.4.
- Sentence centrality revisited for unsupervised summarization. arXiv preprint arXiv:1906.03508. Cited by: §8.
- Extractive summarization as text matching. arXiv preprint arXiv:2004.08795. Cited by: §1, §6.4.
- Extractive summarization as text matching. arXiv preprint arXiv:2004.08795. Cited by: §7.6, §7.6, Table 10, §8.
- Neural document summarization by jointly learning to score and select sentences. arXiv preprint arXiv:1807.02305. Cited by: §1, §6.1, §7.3, §8.
Appendix A Appendix
a.1 Summaries of Hindi documents
We report E-Summ summaries of the two documents referred in Sec. 7.5. For each document, E-Summ Hindi summary appears first, followed by its English translation and E-Summ summary of the original English document.
a.2 Summaries of Marathi documents
We report E-Summ summaries of two Marathi language documents with best and worst semantic similarity scores with their corresponding original document.