Despite numerous recent developments in neural summarization systems Narayan et al. (2018b); Nallapati et al. (2016); See et al. (2017); Kedzie et al. (2018); Gehrmann et al. (2018); Paulus et al. (2017) the underlying rationales behind the improvements and their dependence on the training corpus remain largely unexplored. Edmundson (1969) put forth the position hypothesis: important sentences appear in preferred positions in the document. Lin and Hovy (1997) provide a method to empirically identify such positions. Later, Hong and Nenkova (2014) showed an intentional lead bias in news writing, suggesting that sentences appearing early in news articles are more important for summarization tasks. More generally, it is well known that recent state-of-the-art models Nallapati et al. (2016); See et al. (2017) are often marginally better than the first-k baseline on single-document news summarization.
In order to address the position bias of news articles, Narayan et al. (2018a) collected a new dataset called XSum to create single sentence summaries that include material from multiple positions in the source document. Kedzie et al. (2018) showed that the position bias in news articles is not the same across other domains such as meeting minutes Carletta et al. (2005).
In addition to position, Lin and Bilmes (2012) defined other sub-aspect functions of summarization including coverage, diversity, and information. Lin and Bilmes (2011) claim that many existing summarization systems are instances of mixtures of such sub-aspect functions; for example, maximum marginal relevance (MMR) Carbonell and Goldstein (1998) can be seen as an combination of diversity and importance functions.
Following the sub-aspect theory, we explore three important aspects of summarization (§3): position for choosing sentences by their position, importance for choosing relevant contents, and diversity for ensuring minimal redundancy between summary sentences.
We then conduct an in-depth analysis of these aspects over nine different domains of summarization corpora (§5) including news articles, meeting minutes, books, movie scripts, academic papers, and personal posts. For each corpus, we investigate which aspects are most important and develop a notion of corpus bias (§6). We provide an empirical result showing how current summarization systems are compounded of which sub-aspect factors called system bias (§7). At last, we summarize our actionable messages for future summarization researches (§8). We summarize some notable findings as follows:
Summarization of personal post and news articles except for XSum Narayan et al. (2018a) are biased to the position aspect, while academic papers are well balanced among the three aspects (see Figure 3 (a)). Summarizing long documents (e.g. books and movie scripts) and conversations (e.g. meeting minutes) are extremely difficult tasks that require multiples aspects together.
Biases do exist in current summarization systems (Figure 3 (b)). Simple ensembling of multiple aspects of systems show comparable performance with simple single-aspect systems.
Reference summaries in current corpora include less than 15% of new words that do not appear in the source document, except for abstract text of academic papers.
2 Related Work
We provide here a brief review of prior work on summarization biases. Lin and Hovy (1997) studied the position hypothesis, especially in the news article writing Hong and Nenkova (2014); Narayan et al. (2018a) but not in other domains such as conversations Kedzie et al. (2018). Narayan et al. (2018a) collected a new corpus to address the bias by compressing multiple contents of source document in the single target summary. In the bias analysis of systems, Lin and Bilmes (2012, 2011) studied the sub-aspect hypothesis of summarization systems. Our study extends the hypothesis to various corpora as well as systems. With a specific focus on importance aspect, a recent work Peyrard (2019a) divided it into three sub-categories; redundancy, relevance, and informativeness, and provided quantities of each to measure. Compared to this, ours provide broader scale of sub-aspect analysis across various corpora and systems.
We analyze the sub-aspects on different domains of summarization corpora: news articles Nallapati et al. (2016); Grusky et al. (2018); Narayan et al. (2018a), academic papers or journals Kang et al. (2018); Kedzie et al. (2018), movie scripts Gorinski and Lapata (2015), books Mihalcea and Ceylan (2007), personal posts Ouyang et al. (2017), and meeting minutes Carletta et al. (2005) as described further in §5.
Beyond the corpora themselves, a variety of summarization systems have been developed: Mihalcea and Tarau (2004); Erkan and Radev (2004) used graph-based keyword ranking algorithms. Lin and Bilmes (2010); Carbonell and Goldstein (1998) found summary sentences which are highly relevant but less redundant. Yogatama et al. (2015) used semantic volumes of bigram features for extractive summarization. Internal structures of documents have been used in summarization: syntactic parse trees Woodsend and Lapata (2011); Cohn and Lapata (2008), topics Zajic et al. (2004); Lin and Hovy (2000), semantic word graphs Mehdad et al. (2014); Gerani et al. (2014); Ganesan et al. (2010); Filippova (2010); Boudin and Morin (2013), and abstract meaning representation Liu et al. (2015)
. Concept-based Integer-Linear Programming (ILP) solverMcDonald (2007) is used for optimizing the summarization problem Gillick and Favre (2009); Banerjee et al. (2015); Boudin et al. (2015); Berg-Kirkpatrick et al. (2011). Durrett et al. (2016) optimized the problem with grammatical and anarphorcity constraints.
With a large scale of corpora for training, neural network based systems have recently been developed. In abstractive systems,Rush et al. (2015) proposed a local attention-based sequence-to-sequence model. On top of the seq2seq framework, many other variants have been studied using convolutional networks Cheng and Lapata (2016); Allamanis et al. (2016), pointer networks See et al. (2017), scheduled sampling Bengio et al. (2015)
, and reinforcement learningPaulus et al. (2017). In extractive systems, different types of encoders Cheng and Lapata (2016); Nallapati et al. (2017); Kedzie et al. (2018) and optimization techniques Narayan et al. (2018b) have been developed. Our goal is to explore which types of systems learns which sub-aspect of summarization.
3 Sub-aspects of Summarization
We focus on three crucial aspects : Position, Diversity, and Importance. For each aspect, we use different extractive algorithms to capture how much of the aspect is used in the oracle extractive summaries111See §4 for our oracle set construction.. For each algorithm, the goal is to select extractive summary sentences (equal to the number of sentences in the target summaries for each sample) out of sentences appearing in the original source. The chosen sentences or their indices will be used to calculate the various evaluation metrics described in §4
For some algorithms below, we use vector representation of sentences. We parse a documentinto a sequence of sentences where each sentence consists of a sequence of words . Each sentence is then encoded:
where BERT Devlin et al. (2018) is a pre-trained bidirectional encoder from transformers Vaswani et al. (2017)222The other encoders such as averaging word embeddings Pennington et al. (2014) show comparable performance.. We use the last layer from BERT as a representation of each token, and then average them to get final representation of a sentence. All tokens are lower cased.
Position of sentences in the source has been suggested as a good indicator for choosing summary sentences, especially in news articles Lin and Hovy (1997); Hong and Nenkova (2014); See et al. (2017). We compare three position-based algorithms: First, Last, and Middle, by simply choosing number of sentences in the source document from these positions.
Yogatama et al. (2015) assume that extractive summary sentences which maximize the semantic volume in a distributed semantic space are the most diverse but least redundant sentences. Motivated by this notion, our goal is to find a set of sentences that maximizes the volume size of them in a continuous embedding space like the BERT representations in Eq 1. Our objective is to find the optimal search function that maximizes the volume size of searched sentences: .
If =, we use every sentence from the source document. (Figure 7 (a)). However, its volume space does not guarantee to maximize the volume size because of the non-convex polygonality. In order to find a convex maximum volume, we consider two different algorithms described below.
Heuristic. Yogatama et al. (2015) heuristically choose a set of summary sentences using a greedy algorithm: It first chooses a sentence which has the farthest vector representation from the centroid of whole source sentences, and then repeatedly finds sentences whose representation is farthest from the centroid of vector representations of the chosen sentences. Unlike the original algorithm in Yogatama et al. (2015) restricting the number of words, we constrain the total number of selected sentences to . This heuristic algorithm can fail to find the maximum volume depending on its starting point and/or the farther distance between two points detected (Figure 7 (b)).
ConvexFall. Here we first find the convexhull333Definition: a set of points is defined as the smallest convex set that includes the points. using Quickhull Barber et al. (1996), implemented by Qhull library444http://www.qhull.org/. It guarantees the maximum volume size of selected points with minimum number of points (Figure 7 (c)). However, it does not reduce a redundancy between the points over the convex-hull, and usually choose larger number of sentences than . Marcu (1999) shows an interesting study regarding an importance of sentences: given a document, if one deletes the least central sentence from the source text, then at some point the similarity with the reference text rapidly drops at sudden called the waterfall phenomena. Motivated by his study, we similarly prune redundant sentences from the set chosen by convex-hull search. For each turn, the sentence with the lowest volume reduction ratio is pruned until the number of remaining sentences is equivalent to .
We assume that contents that repeatedly occur in one document contain important information. We find sentences that are nearest to the neighbour sentences using two distance measures: N-Nearest calculates an averaged Pearson correlation between one and the rest for all source sentence vector representations. sentences having the highest averaged correlation are selected as final extractive summaries. On the other hand, K-Nearest chooses the nearest sentences per each sentence, and then averages distances between each nearest sentence and the selected one. The one has the lowest averaged distance is chosen. This calculation is repeated times and the selected sentences are removed from the remaining pool.
In order to determine the aspects most crucial to the summarization task, we use three evaluation metrics:
ROUGE is Recall-Oriented Understudy for Gisting Evaluation Lin and Hovy (2000) for evaluating summarization systems. We use ROUGE-1 (R1), ROUGE-2 (R2), and ROUGE-L (RL) F-measure scores which corresponds to uni-gram, bigrams and longest common subsequences, respectively, and their averaged score (R).
Volume Overlap (VO) ratio. Hard metrics like ROUGE often ignore semantic similarities between sentences. Based on the volume assumption in Yogatama et al. (2015), we measure overlap ratio of two semantic volumes calculated by the model and target summaries. We obtain a set of vector representations of the reference summary sentences and the model summary sentences predicted by any algorithm in §3 for the -th document:
Each volume V is then calculated using the convex-hull algorithm and their overlap () is calculated using a shapely package555https://pypi.org/project/Shapely/666Due to the lack of overlap calculation between two polygons of high dimensions, we reduce it to 2D PCA space.. The final VO is then:
where is the total number of input documents, is the BERT sentence encoder in Eq 1, and and are a set of vector representations of the reference and model summary sentences, respectively. The volume overlap indicates how two summaries are semantically overlapped in a continuous embedding space.
Sentence Overlap (SO) ratio. Even though ROUGE provides a recall-oriented lexical overlap, we don’t know the upper-bound on performance (called oracle) of the extractive summarization. We extract the oracle extractive sentences (i.e. a set of input sentences) which maximizes ROUGE-L F-measure score with the reference summary. We then measure sentence overlap (SO) which determines how many extractive sentences from our algorithms are in the oracle summary. The SO is:
where C is a function for counting the number of elements in a set. The sentence overlap indicates how well the algorithm finds the oracle summaries for extractive summarization.
|Data size||287K/11K||992K/109K||203K/11K||10K/550||21K/2.5K||404/48||98/20||- /53||- /1K|
|Avg src sents.||40/34||24/24||33/33||45/45||97/97||19/15||767/761||- /6.7K||- /3K|
|Avg tgt sents.||4/4||1.4/1.4||1/1||6/6||10/10||1/1||17/17||- /336||- /5|
|Avg src tokens||792/779||769 /762||440/442||1K/1K||2.4K/2.3K||296/236||6.1K/6.4K||- /117K||- /23.4K|
|Avg tgt tokens||55/58||30/31||23/23||144/146||258/258||24/25||281/277||- /6.6K||- /104|
5 Summarization Corpora
We use various domains of summarization datasets to conduct the bias analysis across corpora and systems. Each dataset has source documents and corresponding abstractive target summaries. We provide a list of datasets used along with a brief description and our pre-processing scheme:
CNNDM Nallapati et al. (2016): contains 300K number of online news articles. It has multiple sentences (4.0 on average) as a summary.
Newsroom Grusky et al. (2018): contains 1.3M news articles and written summaries by authors and editors from 1998 to 2017. It has both extractive and abstractive summaries.
XSum Narayan et al. (2018a): has news articles and their single but abstractive sentence summaries mostly written by the original author.
PubMed Kedzie et al. (2018): is 25,000 medical journal papers from the PubMed Open Access Subset.777https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/ Unlike PeerRead, full paper except for abstract is used as source documents.
MScript Gorinski and Lapata (2015): is a collection of movie scripts from ScriptBase corpus and their corresponding user summaries of the movies.
BookSum Mihalcea and Ceylan (2007): is a dataset of classic books paired to summaries from Grade Saver888http://www.gradesaver.com and Cliff’s Notes999http://www.cliffsnotes.com/. Due to a large number of sentences, we only choose the first 1K sentences for source document and the first 50 sentences for target summaries.
AMI Carletta et al. (2005): is documented meeting minutes from a hundred hours of recordings and their abstractive summaries.
Table 1 summarizes the characteristics of each dataset. We note that the Gigaword Graff et al. (2003), New York Times101010https://catalog.ldc.upenn.edu/LDC2008T19, and Document Understanding Conference (DUC)111111http://duc.nist.gov are also popular datasets commonly used in summarization analyses, though here we exclude them as they represent only additional collections of news articles, showing similar tendencies to the other news datasets such as CNNDM.
6 Analysis on Corpus Bias
We conduct different analyses of how each corpus is biased with respect to the sub-aspects. We highlight some key findings for each sub-section.
6.1 Multi-aspect analysis
Table 2 shows a comparison of the three aspects for each corpus where we include random selection and the oracle set. For each dataset metrics are calculated on a test set except for BookSum and AMI where we use train+test due to the smaller sample size.
Earlier isn’t always better. Sentences selected early in the source show high ROUGE and SO on CNNDM, Newsroom, Reddit, and BookSum, but not in other domains such as medial journals and meeting minutes, and the condensed news summaries (XSum). For summarization of movie scripts in particular, the last sentences seem to provide more important summaries.
XSum requires much importance than other corpora. Interestingly, the most powerful algorithm for XSum is N-Nearest. This shows that summaries in XSum are indeed collected by abstracting multiple important contents into single sentence, avoiding the position bias.
First, ConvexFall, and N-Nearest tend to work better than the other algorithms for each aspect. First is better than Last or Middle in new articles except for XSum and personal posts, while not in academic papers (i.e., PeerRead, PubMed) and meeting minutes. ConvexFall finds the set of sentences that maximize the semantic volume overlap with the target sentences better than the heuristic one.
ROUGE and SO show similar behavior, while VO does not.
In most evaluations, ROUGE scores are linear to SO ratios as expected. However, VO has high variance across algorithms and aspects. This is mainly because the semantic volume assumption maximizes the semantic diversity, but sacrifices other aspects like importance by choosing the outlier sentences over the convex hull.
Social posts and news articles are biased to the position aspect while the other two aspects appear less relevant. (Figure 3 (a)) However, XSum requires all aspects equally but with relatively less relevant to any of aspects than the other news corpora.
Paper summarization is a well-balanced task. The variance of SO across the three aspects in PeerRead and PubMed is relatively smaller than other corpora. This indicates that abstract summary of the input paper requires the three aspects at the same time. PeerRead has relatively higher SO then PubMed because it only summarize text in Introduction section, while PubMed summarize whole paper text, which is much difficult (almost random performance).
Conversation, movie script and book summarization are very challenging. Conversation of spoken meeting minutes includes a lot of witty replies repeatedly (e.g., ‘okay.’ , ‘mm -hmm.’ , ‘yeah.’), causing importance and diversity measures to suffer. MScript and BookSum which include very long input document seem to be extremely difficult task, showing almost random performance.
6.2 Intersection between the sub-aspects
Averaged ratios across the sub-aspects do not capture how the actual summaries overlap with each other. Figure 14 shows Venn diagrams of how sets of summary sentences chosen by different sub-aspects are overlapped each other on average.
XSum, BookSum, and AMI have high Oracle Recall. If we develop a mixture model of the three aspects, the Oracle Recall means its upper bound, meaning that another sub-aspect should be considered regardless of the mixture model. This indicates that existing procedures are not enough to cover the Oracle sentences. For example, AMI and BookSum have a lot of repeated noisy sentences, some of which could likely be removed without a significant loss of pertinent information.
Importance and Diversity are less overlapped with each other. This means that important sentences are not always diverse sentences, indicating that they should be considered together.
6.3 Summaries in a embedding space
Figure 15 shows two dimensional PCA projections of a document in CNNDM on the embedding space.
Source sentences are clustered on the convexhull border, not in the middle.
We conjecture that sentences are not uniformly distributed in the embedding space but their positions gradually move over the convexhull. Target summaries reflect different sub-aspects according to the sample and corpora. For example, many target sentences inCNNDM are near by First-k sentences.
6.4 Single-aspect analysis
We calculate the frequency of source sentences overlapped with the oracle summary where the source sentences are ranked differently according to the algorithm of each aspect (See Figure 19
). Heavily skewed histograms indicate that oracle sentences are positively (right-skewed) or negatively (left-skewed) related to the sub-aspect.
In most cases, some oracle sentences are overlapped to the first part of the source sentences. Even though their degrees are different, oracle summaries from many corpora (i.e, CNNDM, NewsRoom, PeerRead, BookSum, MScript) are highly related to the position. Compared to the other corpora, PubMed and AMI contain more top-ranked important sentences in their oracle summaries. News articles and papers tend to find oracle sentences without diversity (i.e., right-skewed), meaning that non-diverse sentences are frequently selected as part of the oracle.
We also measure how many new words occur in abstractive target summaries, by comparing overlap between oracle summaries and document sentences (Table 3). One thing to note is that XSum and AMI have less new words in their target summaries. On the other hand, paper datasets (i.e., PeerRead and PubMed) include a lot, indicating that abstract text in academic paper is indeed “abstract”.
ROUGE of oracle summaries and averaged N-gram overlap ratios.O, T and S are a set of N-grams from Oracle, Target and Source document, respectively. R(O,T) is the averaged ROUGE between oracle and target summaries, showing how similar they are. OT shows N-gram overlap between oracle and target summaries. The higher the more overlapped words in between. TS is a proportion of N-grams in target summaries not occurred in source document. The lower the more abstractive (i.e., new words) target summaries.
7 Analysis on System Bias
We study how current summarization systems are biased with respect to three sub-aspects. In addition, we show that a simple ensemble of systems shows comparable performance to the single-aspect systems.
We compare various extractive and abstractive systems: For extractive systems, we use K-Means Lin and Bilmes (2010), Maximal Marginal Relevance (MMR) Carbonell and Goldstein (1998), cILP Gillick and Favre (2009); Boudin et al. (2015), TexRank Mihalcea and Tarau (2004), LexRank Erkan and Radev (2004) and three recent neural systems; CL Cheng and Lapata (2016), SumRun Nallapati et al. (2017), and S2SExt Kedzie et al. (2018). For abstractive systems, we use WordILP Banerjee et al. (2015) and four neural systems; S2SAbs Rush et al. (2015), Pointer See et al. (2017), Teacher Bengio et al. (2015), and RL Paulus et al. (2017). The detailed description and experimental setup for each algorithm are in Appendix.
Proposed ensemble systems.
Motivated by the sub-aspect theory Lin and Bilmes (2012, 2011), we combine different types of systems together from two different pools of extractive systems: asp from the three best algorithm from each aspect and ext from all extractive systems. For each combination, we choose the sumary sentences randomly among the union set of the predicted sentences (rand) or the most frequent unique sentences (topk).
Table 4 shows a comparison of existing and proposed summarization systems on the set of corpora in §5 except for Newsroom121212We exclude it because of its similar behavior as CNNDM.. Neural extractive systems such as CL, SumRun and S2SExt outperform the others in general. LexRank is highly biased toward the position aspect. On the other hand, MMR is extremely biased to the importance aspect on XSum and Reddit. Interestingly, neural extractive systems are somewhat balanced compared to the others. Ensemble systems seem to have the three sub-aspects in balance, compared to the neural extractive systems. They also outperform the others (either ROUGE or SO) on five out of eight datasets.
8 Conclusion and Future Directions
We define three sub-aspects of text summarization: position, diversity, and importance. We analyze how different domains of summarization dataset are biased to these aspects. We observe that news articles strongly reflect the position aspect, while the others do not. In addition, we investigate how current summarization systems reflect these three sub-aspects in balance. Each type of approach has its own bias, while neural systems rarely do. Simple ensembling of the systems shows more balanced and comparable performance than single ones.
We summarize actionable messages for future summarization research:
Different domains of datasets except for news articles pose new challenges to the appropriate design of summarization systems. For example, summarization of conversations (e.g., AMI) or dialogues (MSCript) need to filter out repeated, rhetorical utterances. Book summarization (e.g., BookSum) is very challenging due to its extremely large document size. Here current neural encoders suffer from computation limits.
Summarization systems to be developed should clearly state their computational limits as well as effectiveness in each aspect and in each corpus domain. A good summarization system should reflect different kinds of the sub-aspects harmoniously, regardless of corpus bias. Developing such bias-free or robust models can be very important for future directions.
Nobody has clearly defined the deeper nature of meaning abstraction yet. A more theoretical study of summarization, and the various aspects, is required. A recent notable example is Peyrard (2019a)’s attempt to theoretically define different quantities of importance aspect, and demonstrate the potential of the framework on an existing summarization system. Similar studies can be applied to other aspects and their combinations in various systems and different domains of corpora.
One can repeat our bias study on evaluation metrics. Peyrard (2019b) showed that widely used evaluation metrics (e.g., ROUGE, Jensen-Shannon divergence) are strongly mismatched in scoring summary results. One can compare different measures (e.g., n-gram recall, sentence overlaps, embedding similarities, word connectedness, centrality, importance reflected by discourse structures), and study bias of each with respect to systems and corpora.
This work would not have been possible without the efforts of the authors who kindly share the summarization datasets publicly. We thank Rada Mihalcea for sharing the book summarization dataset. We also thank Diane J. Litman, Taylor Berg-Kirkpatrick, Hiroaki Hayashi, and anonymous reviewers for their helpful comments.
- A convolutional attention network for extreme summarization of source code. arXiv preprint arXiv:1602.03001. Cited by: §2.
Multi-document abstractive summarization using ilp based multi-sentence compression.
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015), Cited by: Appendix A, §2, §7.
- The quickhull algorithm for convex hulls. ACM Transactions on Mathematical Software (TOMS) 22 (4), pp. 469–483. Cited by: §3.2.
Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pp. 1171–1179. Cited by: Appendix A, §2, §7.
- Jointly learning to extract and compress. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 481–490. Cited by: §2.
- Keyphrase extraction for n-best reranking in multi-sentence compression. In North American Chapter of the Association for Computational Linguistics (NAACL), Cited by: §2.
Concept-based summarization using integer linear programming: from concept pruning to multiple optimal solutions.
Conference on Empirical Methods in Natural Language Processing (EMNLP) 2015, Cited by: Appendix A, §2, §7.
- The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 335–336. Cited by: §1, §2, §7.
The ami meeting corpus: a pre-announcement.
International Workshop on Machine Learning for Multimodal Interaction, pp. 28–39. Cited by: §1, §2, 9th item.
- Neural summarization by extracting sentences and words. arXiv preprint arXiv:1603.07252. Cited by: Appendix A, §2, §7.
- Sentence compression beyond word deletion. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pp. 137–144. Cited by: §2.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.
- Learning-based single-document summarization with compression and anaphoricity constraints. arXiv preprint arXiv:1603.08887. Cited by: §2.
- New methods in automatic extracting. Journal of the ACM (JACM) 16 (2), pp. 264–285. Cited by: §1.
- LexRank: graph-based lexical centrality as salience in text summarization. Journal of Artificial Intelligence Research, pp. 457–479. Cited by: Appendix A, §2, §7.
- Multi-sentence compression: finding shortest paths in word graphs. In Proceedings of the 23rd International Conference on Computational Linguistics, pp. 322–330. Cited by: §2.
- Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions. In Proceedings of the 23rd international conference on computational linguistics, pp. 340–348. Cited by: §2.
- Bottom-up abstractive summarization. arXiv preprint arXiv:1808.10792. Cited by: §1.
- Abstractive summarization of product reviews using discourse structure. In Proceedings of EMNLP, Cited by: §2.
- A scalable global model for summarization. In Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing, pp. 10–18. Cited by: Appendix A, §2, §7.
- Movie script summarization as graph-based scene extraction. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1066–1076. Cited by: §2, 6th item.
- English gigaword. Linguistic Data Consortium, Philadelphia 4 (1), pp. 34. Cited by: §5.
- NEWSROOM: a dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana, pp. 708–719. External Links: Cited by: §2, 2nd item.
Improving the estimation of word importance for news multi-document summarization. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 712–721. Cited by: §1, §2, §3.1.
- A dataset of peer reviews (peerread): collection, insights and nlp applications. In Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), New Orleans, USA. External Links: Cited by: §2, 4th item.
Content selection in deep learning models of summarization. arXiv preprint arXiv:1810.12343. Cited by: Appendix A, §1, §1, §2, §2, §2, 5th item, 8th item, §7, footnote 13.
- Identifying topics by position. In Fifth Conference on Applied Natural Language Processing, Cited by: §1, §2, §3.1.
- The automated acquisition of topic signatures for text summarization. In Proceedings of the 18th conference on Computational linguistics-Volume 1, pp. 495–501. Cited by: §2, §4.
- Rouge: a package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL-04 workshop, Vol. 8. Cited by: 4th item.
- Learning mixtures of submodular shells with application to document summarization. arXiv preprint arXiv:1210.4871. Cited by: §1, §2, §7.
- Multi-document summarization via budgeted maximization of submodular functions. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 912–920. Cited by: Appendix A, §2, §7.
- A class of submodular functions for document summarization. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pp. 510–520. Cited by: §1, §2, §7.
- Toward abstractive summarization using semantic representations. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1077–1086. Cited by: §2.
- Discourse trees are good indicators of importance in text. Advances in automatic text summarization 293, pp. 123–136. Cited by: §3.2.
- A study of global inference algorithms in multi-document summarization. Springer. Cited by: §2.
- Abstractive summarization of spoken and written conversations based on phrasal queries. In Proc. of ACL, pp. 1220–1230. Cited by: §2.
- Explorations in automatic book summarization. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), Cited by: §2, 7th item.
- Textrank: bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, pp. 404–411. Cited by: Appendix A, §2, §7.
- Summarunner: a recurrent neural network based sequence model for extractive summarization of documents. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: Appendix A, §2, §7.
- Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 280–290. Cited by: §1, §2, 1st item.
Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745. Cited by: 1st item, §1, §2, §2, 3rd item.
- Ranking sentences for extractive summarization with reinforcement learning. arXiv preprint arXiv:1802.08636. Cited by: §1, §2.
- Crowd-sourced iterative annotation for narrative summarization corpora. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pp. 46–51. Cited by: §2, 8th item.
- A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304. Cited by: Appendix A, §1, §2, §7.
- Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: Appendix A, footnote 2.
- A simple theoretical model of importance for summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1059–1073. External Links: Cited by: §2, 3rd item.
- Studying summarization evaluation metrics in the appropriate scoring range. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5093–5100. External Links: Cited by: 4th item.
A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685. Cited by: Appendix A, §2, §7.
- Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368. Cited by: Appendix A, §1, §2, §3.1, §7.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.
- Learning to simplify sentences with quasi-synchronous grammar and integer programming. In Proceedings of the conference on empirical methods in natural language processing, pp. 409–420. Cited by: §2.
- Extractive summarization by maximizing semantic volume.. In EMNLP, pp. 1961–1966. Cited by: 4th item, §2, §3.2, §3.2, §4.
- Bbn/umd at duc-2004: topiary. In Proceedings of the HLT-NAACL 2004 Document Understanding Workshop, Boston, pp. 112–119. Cited by: §2.
Appendix A Systems and Setup: Details
For extractive systems, K-Means rank sentences clusters by descending order of cluster sizes, and then using a greedy algorithm Lin and Bilmes (2010) to select the nearest sentences to the centroid. Maximal Marginal Relevance (MMR) finds sentences which are highly relevant to the document but less redundant with sentences already selected for a summary. cILP Gillick and Favre (2009); Boudin et al. (2015) weights sub-sentences and maximizes their coverage by minimizing redundancy globally using Integer Linear Program (ILP). TexRank Mihalcea and Tarau (2004) automatically extracts keywords using Levenshtein distance between the text keywords. LexRank Erkan and Radev (2004) uses module centrality for ranking the keywords. In addition, we also use the recent three neural extractive systems: CL Cheng and Lapata (2016), SumRun Nallapati et al. (2017), and S2SExt Kedzie et al. (2018), where each has a little variation in their extraction architecture131313See Kedzie et al. (2018) for a detailed comparison..
In training CL, SumRun, and S2SExt, we use upweight positive labels to make them proportional to the negative labels. We use 200 embedding size of GloVe Pennington et al. (2014) pre-trained embeddings with 0.25 dropout on embeddings, fixing it not to be trained during training. We use CNN encoder with 6 window size as [25, 25, 50, 50, 50, 50] feature maps. We use 1-layer of sequence-to-sequence model with 300 size of LSTM and 100 size of MLP with 0.25 dropout. SumRun uses 16 size of segment and 16 size of position embeddings.
For abstractive systems, we use WordILP Banerjee et al. (2015) that produces a word graph of important sentences and then choose sentences from the word graph employing a ILP solver. We also use incremental sequence-to-sequence models: a basic S2SAbs Rush et al. (2015) with Pointer network See et al. (2017), with teacher forcing Teacher Bengio et al. (2015), and with reinforcement learning on the evaluation metrics, and RL Paulus et al. (2017).
In training S2SAbs, Pointer, Pointer, and RL, we use 150 hidden size of GRU with 300 size of GloVe embeddings. Pointer uses maximum coverage function using NLL loss. Teacher uses 0.75 ratio of teach forcing with exponential decaying function. and RL
uses 0.1 ratio of RL optimization after the first epoch ofS2SAbs training. We use 4 size of beam searching at decoding. We use 32 batch size with adam optimizer of 0.001 learning rate.
For MScript, the original dataset has no data split, so we randomly split it by 0.9, 0.05, 0.05 for train, valid, test set, respectively.
Appendix B Venn Diagram for All Datasets
Sentence Venn diagrams among three aspects and oracle for all datasets are shown in Figure 29. Newsroom has an analogous pattern to XSum. Compared to PeerRead, , PubMed has relatively less sentence overlap between First-k and the other two aspects. MScript has extremely small oracle sentence overlaps to all three aspects. However, it is mainly because of the characteristics of the dataset: it has long source documents (1k sentences on average) with short (5 sentences on average) summary.
Appendix C Full ROUGE F Scores for Corpus Bias Analysis
Appendix D Documents in an Embedding Space: for All Datasets
In Figure (34,39), we have more two-dimensional PCA projection examples for source documents from all datasets. We find a weak pattern about where target sentences lie on according to the number of them. For example, from XSum and Reddit which have a single target sentence, we investigate that some target sentences are located in the middle of ConvexHull, which are far from any source sentences.
Appendix E System Biases per each corpus with the Three Sub-aspects
In Figure 48, we have more diagrams showing system biases toward each of three sub aspects. We find that there exists a bias according to the corpus: for example in Reddit, many systems have a importance bias in common. On the other hand, systems are biased toward a diversity aspect in AMI. Also, some systems tend to be biased in certain aspect across the different corpus: systems such as KMeans and MMR, many corpora are biased toward a importance aspect.