Extending Multi-Text Sentence Fusion Resources via Pyramid Annotations

by   Daniela Brook Weiss, et al.
Bar-Ilan University

NLP models that compare or consolidate information across multiple documents often struggle when challenged with recognizing substantial information redundancies across the texts. For example, in multi-document summarization it is crucial to identify salient information across texts and then generate a non-redundant summary, while facing repeated and usually differently-phrased salient content. To facilitate researching such challenges, the sentence-level task of sentence fusion was proposed, yet previous datasets for this task were very limited in their size and scope. In this paper, we revisit and substantially extend previous dataset creation efforts. With careful modifications, relabeling and employing complementing data sources, we were able to triple the size of a notable earlier dataset. Moreover, we show that our extended version uses more representative texts for multi-document tasks and provides a larger and more diverse training set, which substantially improves model training.



There are no comments yet.


page 1

page 2

page 3

page 4


Modeling Endorsement for Multi-Document Abstractive Summarization

A crucial difference between single- and multi-document summarization is...

Enhancing Extractive Text Summarization with Topic-Aware Graph Neural Networks

Text summarization aims to compress a textual document to a short summar...

Constructing Flow Graphs from Procedural Cybersecurity Texts

Following procedural texts written in natural languages is challenging. ...

A Condense-then-Select Strategy for Text Summarization

Select-then-compress is a popular hybrid, framework for text summarizati...

A Proposition-Level Clustering Approach for Multi-Document Summarization

Text clustering methods were traditionally incorporated into multi-docum...

Extractive Multi Document Summarization using Dynamical Measurements of Complex Networks

Due to the large amount of textual information available on Internet, it...

Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts under Precision and Recall Constraints

Semantic annotations have to satisfy quality constraints to be useful fo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite recent advances in summarizing single documents, multi document summarization (MDS) has not progressed with the same pace. The task still poses the same challenges of its single-document counter-part, namely addressing saliency and coherency, but it also requires effective measures for identifying and consolidating redundant information. In light of this, several works proposed a tighter-scoped task, called sentence fusion

, which focuses on summarizing multiple sentences with overlapping content into a non-redundant one. Such a sentence-level task allows a fine-grained analysis of which information units are shared among the input sentences, as well as control over different degrees of information inclusion and exclusion. Sentence fusion may thus stand as a task on its own or serve as a component task within MDS or other text generation settings.

However, the available resources for fusing sentences which exhibit significant content overlap are still lacking, with the most recent datasets containing only several hundreds of examples McKeown et al. (2010); Thadani and McKeown (2013), impeding further research. In this work, we follow Thadani and McKeown (2013) and extend their described sentence fusion dataset, which is derived from Pyramid annotations created for MDS evaluation Nenkova and Passonneau (2004). Table 1 illustrates a single fusion instance taken from the Thadani and McKeown (2013) dataset, where sentences (a-d), whose contents overlap, are taken from different reference summaries for the same source texts. The overlapping content parts were annotated by experts into SCU (Summary Content Unit) spans (in bold) and then summarized into a short sentence, denoted as the gold fusion label.

a. Fisheries in parts of the Philippines have been decimated by the use of cyanide in fishing.
b. Philippine fishermen use cyanide in fishing, needlessly destroying immature fish.
c. Sodium cyanide use by fisherman decimates fish.
d. In the Philippines some fishermen use homemade explosives and cyanide for driving fish away from reefs and into nets.
Label Sodium cyanide use by fisherman decimates fish
Table 1: Sentence fusion example. (a-d) are the input sentences, originating from different expert summaries. Spans that are considered as contributing to the same single unit of content (SCU) are in bold. Label represents the gold SCU label, representing the fusion output.

We find that the heuristics and filters applied on their original dataset result in short and highly related sentences, which may not reflect more complex and long sentences that are often found in multi-text consolidation tasks. Moreover, their dataset uses exclusively sentences from expert summaries, which tend to be short and concise, and exclude the actual source documents that are used in practice for summarization. The resulting high similarity within input sentences makes them amenable to extractive methods, where a representative sentence can be selected as the summary, curbing the efforts to develop an abstractive fusion of sentences. In this paper, we remove most of their filters, relabel a portion of the instances, and supplement the data with sentences taken from source documents, while providing more challenging examples that better reflect realistic multi-source summarization tasks.

Our contribution therefore is an extended sentence-fusion dataset111Our Code and data can be found here: https://github.com/DanielaBWeiss/Extending-Sentence-Fusion-Resources, more than 3x times larger than its original, with examples extracted from both summary and document sources, along with a manual relabeling of 18% of our target labels to better reflect the information overlap. We show that our final extended dataset is more representative of related and overlapping sentences from multiple sources in the wild. In addition, we provide fusion baseline models trained on both the original and our extended datasets. We show that a model trained on our extended datasets outperforms in ROUGE metrics (Lin, 2004) on the original test set, while also seeming to generate better outputs that reflect the true content intersection of the inputs. Given that sentence fusion was originally introduced as a step in a multi-document summary pipeline Barzilay and McKeown (2005); Marsi and Krahmer (2005); McKeown et al. (2010); Thadani and McKeown (2013), we believe that progress on this focused task may lead to new insights and breakthroughs in the larger scoped multi-document summarization setting, as well as additional text consolidation tasks.

2 Related Work

Sentence fusion is a sentence intersection generation task introduced in the context of multi-documents. The task takes as input multiple sentences that share overlapping content, and produces a single sentence focused on that overlap Barzilay and McKeown (2005); Filippova and Strube (2008); Marsi and Krahmer (2005); McKeown et al. (2010); Thadani and McKeown (2013). Several other variants of sentence fusion have also been explored, such as sentence union (fusing the union of information in the input) Marsi and Krahmer (2005), and even strict sentence intersection Thadani and McKeown (2011) (producing strictly the intersection and nothing more). For multi-text summarization however, a “looser” sense of sentence intersection is desired, since redundant content is most likely salient, yet additional non-overlapping information may still be relevant for a final summarized sentence. For this reason, our extended dataset follows this “looser” sense of sentence intersection as was used and described in Barzilay and McKeown (2005), McKeown et al. (2010) and Thadani and McKeown (2013).

It should be noted that, in recent years, the notion of sentence fusion been used also to denote a quite different task variant, called “disparate” sentence fusion Elsner and Santhanam (2011); Geva et al. (2019); Lebanoff et al. (2019, 2020). Under this variant, the fused sentences do not exhibit considerable content overlap but are rather related in discourse. Accordingly, this task variant is concerned with “gluing" the input sentences into a longer sentence, while generating appropriate discourse structure, possibly inserting certain discourse connectives. This task is fundamentally different than the task we address, of fusing significantly redundant information, which is particularly relevant in the context of consolidating information from multiple documents (while discourse-oriented fusion is more often considered in the context of single-document summarization).

Filter Filtered SCU Labels
1 SL Saudi Arabia urged withdrawal
2 NV Resignation of Prime Minister Karami and his government
3 NV | SL Murder in Boulder, Colorado
4 NV Confirmed bird flu cases in Hong Kong
Filtered SCU Instance
5 Statoil admitted responsibility for the leak
A Statoil’s internal investigation acknowledged inadequate planning and a lack of risk appreciation led to the leak.
B Statoil admitted the leak resulted from inadequate planning and appreciation of risk, and failure to observe governing documentation.
Table 2: Originally filtered SCU instances in PyrFus (Thadani and McKeown, 2013). NV – No Verb, SL – Short Label. Examples 1-4 are SCU labels that were automatically filtered based on the label alone, while the last example was filtered due to the SCU contributing spans in A and B being much longer than the label itself, expressed as a short abstract paraphrase.

3 Data Collection

The original established dataset (Thadani and McKeown, 2013) for sentence fusion leverages annotations made during post-hoc evaluation of multi-document summarization systems. It is imperative to inspect the origin of the data and how it was re-purposed for this task (§3.1), as well as review the different processing steps implemented in previous works (§3.2), in order to assess our revisions and supplements (§3.3).

3.1 From Summary Evaluation to Sentence Fusion

The pyramid method, proposed by Nenkova and Passonneau (2004), is a well-known evaluation method for content selection in summarization, which was used extensively in the DUC222https://www-nlpir.nist.gov/projects/duc/data.html, years used 2005-2008 and TAC333https://tac.nist.gov/, years used 2009-2011 benchmarks for MDS. Applying this method, informative content in different documents is dissected into small information units, and equivalent units are aggregated across documents. The resulting clusters of largely-equivalent information units, each centered around a single statement, were utilized in subsequent work to form the basis for sentence fusion examples Thadani and McKeown (2013).

In DUC and TAC, a set of related documents over the same topic was summarized by at least four expert annotators, producing four or more short reference summaries per topic. Then, the reference summaries were further analyzed and divided into a set of informational units named Summary Content Units (SCUs). Each SCU denotes a single short statement, e.g. cyanide use by fisherman decimates fish (see bold spans in Table 1), and may be expressed in multiple summaries (and source documents) under different manifestations.

To mark an SCU, the annotator marks spans of text that directly contribute to it (SCU contributors), and labels the SCU by writing a concise statement in natural language, which is called SCU Label. The same label is applied to equivalent content-units across documents, grouping together span contributions from different sources. Table 1 presents an example of an SCU cluster, containing four sentences with contributing spans (in bold), along with their associated SCU label. When evaluating an MDS system, a generated summary is scored proportionally to the number of SCUs whose information is covered by the summary, where each SCU is weighted its frequency in the gold reference summaries. To create a sample for sentence fusion, the sentences in each SCU cluster are taken as input, and the SCU label, which concisely summarizes the contributing spans in the cluster, is considered as the gold text for the targeted fusion output.

3.2 Pre-processing for Generating Fusion Examples

Thadani and McKeown (2013) applied several pre-processing steps to generate a fusion dataset (termed here as PyrFus) from SCUs. While the original intention was to reduce noisy samples, these steps also removed a significant portion of challenging fusion instances. The applied steps include discarding all clusters that: (1) have more than 4 contributing sentences; (2) have SCU labels that don’t contain a verb after the first token; (3) have SCU labels and source sentences with less than 5 words or more than 100; (4) have contributing spans that are shorter than half of their source sentence; (5) have SCU labels that are shorter than half of the shortest contributing span in the input; (6) have SCU labels with non-shared tokens in any of the source sentences.

Filters (2) and (3) were used because not all SCU labels in the original data were grammatical sentences, while (4) and (5) were applied to remove inputs with too little mutual content, and often the SCU label did not cover the whole information intersection of the input sentences.444Since SCUs mostly cover single propositions, sometimes the same cluster of summary sentences will share multiple SCU labels, but are split to individual fusion instances given the annotation protocol of the pyramid data. Filter (6) was applied so that all gold fusion labels would be lexically reachable without the need for paraphrasing. The resulting dataset was the largest available source for supervised sentence fusion focused on multi-text, until this work, with a total of 1705 fusion instances 555The number originally reported is slightly higher, while this is the number we were able to reproduce using the author’s published code., compared to the previously available fusion data of 300 instances McKeown et al. (2010).

3.3 Extending Pyramid Fusion

While Thadani and McKeown (2013) (PyrFus) created a polished dataset, we argue that the skipped clusters would more closely resemble document sentences used in the multi-document summarization task, and learning from such examples is feasible given contemporary models. Additionally, DUC also made available the SCU Marked Corpus Copeck and Szpakowicz (2005), which automatically maps source document sentences to SCU labels using text match. We make use of these mappings to extend our dataset with document sentences, which were overlooked in PyrFus.

We found that certain filters were safe to remove while retaining a successful representation of the sentence intersection task. For instance, the majority of SCU labels that do not have a verb use verb nominalizations, or that sentences beginning with a verb are coherent. Table 2 shows a few examples of fusion instances originally filtered in PyrFus that we have kept in our extended version of this dataset, termed PyrFus++. Similarly, the majority of SCU labels of length 4 were also found to be coherent and descriptive of their input sentences, and therefore we decided to discard SCU clusters only if they do not meet this threshold. Shorter SCU labels (length 3 and below) were deemed too short to describe a complete summarized sentence (see examples in Appendix §A). Additionally, we allow SCU clusters that have low overlap between their label and their marked contributing spans (see ex. 5 in Table 2). And finally, we keep fusion instances whose SCU label tokens are not fully covered by their input sentences to allow for paraphrastic examples.

Once we removed most of the filtering pipeline of PyrFus, we noticed that almost 20% of the fusion input clusters share more than one SCU label (i.e. they share more than a single informative proposition). To accommodate the target label of such instances, we manually re-label such clusters using all shared SCU labels into a single sentence. For example, for the following two SCU labels: Clinical trials typically involve three phases and Clinical trials involve an average of 200 patients per trial, a new merged fusion label would be: Clinical trials typically involve three phases and an average of 200 patients per trial.

In total, we have extended the fusion dataset from its original 1705 instances to 7502, creating a varied dataset using both reference summary and document sentences. We suggest that the additional instances previously skipped, along with the modifications we introduced, would more closely resemble real-world content fusion challenges in multi-document summarization and related settings.

Fusion Data Total Avg Clus. L-to-S S-to-S
Lebanoff et al. (2020) 1599 2 32.7 15.0
PyrFus 1705 2.8 46.5 35.0
-PyrFus 5842 3.3 34.6 31.6
PyrFus++ 7502 3.3 37.8 32.2
Table 3: Comparisons of different fusion datasets and variations. Lebanoff et al. (2020) introduced a disparate-fusion dataset, containing exactly 2 input sentences from a single document. L-to-S and S-to-S refer to label-to-sentence and sentence-to-sentence ROUGE scores, respectively.

4 Data Analysis

We would like to assess the similarity of the source sentences among themselves and also against their fusion target. To that end, we use the ROUGE Lin (2004) metric as a proxy to similarity and calculate the micro average of ROUGE (measuring word overlap), between every input sentence to its SCU label, and between every pair of sentences in the same cluster. Table 3 compares a few established fusion datasets: The “disparate” fusion corpus (Lebanoff et al., 2020) with (mostly non-overlapping) sentences originating from a single document (see §2), i.e. not summary sentences; PyrFus represents the work presented in Thadani and McKeown (2013), PyrFus++ is our revised and extended dataset presented in this work, and -PyrFus refers to the part of our dataset containing SCU clusters that were overlooked in PyrFus.

We first note that the word overlap among source sentences ( S-to-S) is much lower for the disparate fusion set, as expected by the nature of this dataset. This is in contrast to the datasets originating from multi-document sources, indicating that in the former data, the input sentences to fuse share less information. This reinforces our claim that in a true multi-document setting a system will be challenged with dealing with significantly more redundant information, and this has to be applied specifically addressed.

Unsurprisingly, PyrFus contains a much higher source-to-target word overlap ( L-to-S), given that the applied pre-processing explicitly removed instances with less overlap between the SCU label and the source sentences. Adding those instances back to PyrFus++ lowered this metric, making the task more challenging, and realistic. Notably, the difference in ( S-to-S) sentence-to-sentence similarity between the original PyrFus and the added portion -PyrFus is only of 3.4 rouge points. This should indicate that the input sentences remain highly related, even after removing the pre-processing pipeline applied originally, making them viable fusion instances.

Overall we note that while some noisy instances were introduced into PyrFus++, the new fusion clusters express high relatedness between the source sentences and their label. This is also suggested by the level of sentence-to-label word-overlap in -PyrFus being roughly the same as the sentence-to-sentence word overlap in the original PyrFus.

5 Baselines

Train Data Dev Test Test++
PyrFus 36.4 40.9
PyrFus++ 41.5 46.9 32.7
Table 4: Rouge-2 F1 results for the baseline model (BART). Rows are training sets and columns are evaluation sets. Test++ refers to the test set of the extended PyrFus++ dataset. The other evaluation splits refer to the original PyrFus data.

We implement a newer modern baseline for PyrFus (Thadani and McKeown, 2013), which outperforms their pre-neural one666PyrFus evaluation used bigram-F1 (Unno et al., 2006) that is similar to Rouge-2 F1, reporting 24.92 points for their best model. We use the widely accepted Rouge metric to be inline with contemporary works.. To that end, we employ the pre-trained auto-encoder BART Lewis et al. (2020) as our end-to-end generation model due to its demonstrated performance on summarization tasks.

Results, shown in Table 4, were measured with the Rouge-2 F1 metric on the original PyrFus evaluation splits. They clearly show that a fusion model trained on our extended data (PyrFus++) significantly outperforms the same model trained on the original training data, by roughly 6 points. Notably, the model trained on PyrFus++ scored 14 points lower on its own test set, indicating that the new dataset is much more challenging, and yet enables the model to reach better generalizations.

Examining the outputs of both models we find that most fusion outputs are similar and are often extracted from source sentences 777This characterizes both training sets, since the original Pyramid data contains many extractive SCU labels (Thadani and McKeown, 2013). Yet, we also notice that the model trained on PyrFus++ tends to select more salient and shared content from the input. For example, using the same input sentences as in Table 1, the produced labels of both models are shown in Table 5, where both are lexically similar to the source sentences and to the gold SCU label. However, the model trained on PyrFus does not include a critical detail that all input sentences discuss – fish decimation, while the PyrFus++-trained model correctly includes it. Such instances show the necessity of a large and realistic fusion dataset for model training.

SCU Label Sodium cyanide use by fisherman decimates fish
PyrFus In the Philippines some fishermen use cyanide in fishing
PyrFus++ In the Philippines cyanide use by fisherman decimates fish
Table 5: The gold SCU label vs the predictions made by the baseline model trained on PyrFus and PyrFus++

6 Conclusion

In this work we extended a sentence fusion dataset by more than 3 times its original size, while relabeling some of the data. The new dataset includes more complex and relevant training instances, better reflecting those that could be found in “the wild”, and thus facilitates further research on data consolidation in multi-text tasks. In addition, we train baseline fusion models and show that when trained on our extended data we achieve notably better performance on the original available fusion test set, while also generating qualitatively better (“loose") sentence intersections.


  • R. Barzilay and K. McKeown (2005) Sentence fusion for multidocument news summarization. Computational Linguistics 31, pp. 297–328. External Links: Document Cited by: §1, §2.
  • T. Copeck and S. Szpakowicz (2005) Leveraging pyramids. In Text Summarization Branches Out, External Links: Link Cited by: Appendix B, §3.3.
  • M. Elsner and D. Santhanam (2011) Learning to fuse disparate sentences. In Proceedings of the Workshop on Monolingual Text-To-Text Generation, Portland, Oregon, pp. 54–63. External Links: Link Cited by: §2.
  • K. Filippova and M. Strube (2008) Sentence fusion via dependency graph compression. In

    Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

    Honolulu, Hawaii, pp. 177–185. External Links: Link Cited by: §2.
  • M. Geva, E. Malmi, I. Szpektor, and J. Berant (2019) DiscoFuse: a large-scale dataset for discourse-based sentence fusion. In North American Association for Computational Linguistics (NAACL), Cited by: §2.
  • L. Lebanoff, J. Muchovej, F. Dernoncourt, D. S. Kim, L. Wang, W. Chang, and F. Liu (2020) Understanding points of correspondence between sentences for abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Online, pp. 191–198. External Links: Link, Document Cited by: §2, Table 3, §4.
  • L. Lebanoff, K. Song, F. Dernoncourt, D. S. Kim, S. Kim, W. Chang, and F. Liu (2019) Scoring sentence singletons and pairs for abstractive summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2175–2189. External Links: Link, Document Cited by: §2.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880. Cited by: §5.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Link Cited by: §1, §4.
  • E. Marsi and E. Krahmer (2005) Explorations in sentence fusion. In Proceedings of the Tenth European Workshop on Natural Language Generation (ENLG-05), External Links: Link Cited by: §1, §2.
  • K. McKeown, S. Rosenthal, K. Thadani, and C. Moore (2010) Time-efficient creation of an accurate sentence fusion corpus. pp. 317–320. Cited by: §1, §1, §2, §3.2.
  • A. Nenkova and R. Passonneau (2004) Evaluating content selection in summarization: the pyramid method. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, Boston, Massachusetts, USA, pp. 145–152. External Links: Link Cited by: §1, §3.1.
  • K. Thadani and K. McKeown (2011) Towards strict sentence intersection: decoding and evaluation strategies. In Proceedings of the Workshop on Monolingual Text-To-Text Generation, Portland, Oregon, pp. 43–53. External Links: Link Cited by: §2.
  • K. Thadani and K. McKeown (2013) Supervised sentence fusion with single-stage inference. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, Nagoya, Japan, pp. 1410–1418. External Links: Link Cited by: Appendix B, §1, §1, Table 2, §2, §3.1, §3.2, §3.3, §3, §4, §5, footnote 7.
  • Y. Unno, T. Ninomiya, Y. Miyao, and J. Tsujii (2006)

    Trimming CFG parse trees for sentence compression using machine learning approaches

    In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, Sydney, Australia, pp. 850–857. External Links: Link Cited by: footnote 6.

Appendix A Examples of Filtered SCU Labels

Filtered SCU Labels of Length 3
1 Water being diverted
2 FARC commits slaughters
3 There were floods
4 Adverse reactions reported
Table 6: Examples of SCU Labels of length 3 which were not included in this or previous works.

Appendix B Extending Pyramid-based Fusion Data

For the fusion instances containing summary source sentences as fusion inputs, we used the same years reported in Thadani and McKeown (2013) (years 2005-2007 for DUC and 2008-2011 for TAC). The source document sentences found in Copeck and Szpakowicz (2005) were made available from 2005-2008. We made use of all the years except 2005, since we found this year to be containing more varied documents within a topic, which yielded noisier automatic alignments between SCU labels and source document sentences.