Summary Explorer: Visualizing the State of the Art in Text Summarization

08/04/2021 ∙ by Shahbaz Syed, et al. ∙ UNIVERSITÄT LEIPZIG 0

This paper introduces Summary Explorer, a new tool to support the manual inspection of text summarization systems by compiling the outputs of 55 state-of-the-art single document summarization approaches on three benchmark datasets, and visually exploring them during a qualitative assessment. The underlying design of the tool considers three well-known summary quality criteria (coverage, faithfulness, and position bias), encapsulated in a guided assessment based on tailored visualizations. The tool complements existing approaches for locally debugging summarization models and improves upon them. The tool is available at https://tldr.webis.de/

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatic text summarization is the task of generating a summary of a long text by condensing it to its most important parts. This longstanding task originated in automatically creating abstracts for scientific documents luhn:1958, and later extended to documents such as web pages salton:1994 and news articles wasson:1998.

There are two paradigms of automatic summarization: extractive and abstractive

. The former extracts important information from the to-be-summarized text, while the latter additionally involves paraphrasing, sentence-fusion, and natural language generation to create fluent summaries.

Currently, the progress in text summarization is tracked primarily using automatic evaluation with ROUGE lin:2004

as the de facto standard for quantitative evaluation. ROUGE has proven effective for evaluating extractive systems, measuring the overlap of word n-grams between a generated summary and a reference summary (ground truth). Still, it only provides an approximation of a model’s capability to generate summaries that are lexically similar to the ground truth. Moreover, ROUGE is unsuitable for evaluating abstractive summarization systems, mainly due to its inadequacy in capturing all semantically equivalent variants of the reference

ng:2015; kryscinski:2019; fabbri:2021. Besides, a reliable automatic evaluation of a summary is challenging lloret:2018 and strongly dependent on its purposejones:1999.

A robust method to analyze the effectiveness of summarization models is to manually inspect their outputs from individual perspectives such as coverage of key concepts and linguistic quality. However, manual inspection requires obtaining the outputs of certain models, delineating a guideline that comprises particular assessment criteria, and ideally utilizing proper visualization techniques to examine the outputs efficiently.

Figure 1: Overview of Summary Explorer. Its guided assessment process works in four steps: (1) corpus selection, (2) quality aspect selection, (3) model selection, and (4) quality aspect assessment. Exemplified is the assessment of the content coverage of the summaries of four models for a source document from the CNN/DM corpus. For each summary sentence, its two most related source document sentences are highlighted on demand.

To this end, we present Summary Explorer (Figure 1), an online interactive visualization tool that assists humans (researchers, experts, and crowds) to inspect the outputs of text summarization models in a guided fashion. Specifically, we compile and host the outputs of several state-of-the-art models (currently 55) dedicated to English single-document summarization. These outputs cover three benchmark summarization datasets comprising semi-extractive to highly abstractive ground truth summaries. The tool facilitates a guided visual analysis of three important summary quality criteria: coverage, faithfulness, and position bias, where tailored visualizations for each criterion streamline both absolute and relative manual evaluation of summaries. Overall, our use cases (see Section 5) demonstrate the ability of Summary Explorer to provide a comparative exploration of the state-of-the-art text summarization models, and to discover interesting cases that cannot likely be captured by automatic evaluation.

2 Related Work

Leaderboards such as Paperswithcode,111https://paperswithcode.com/task/text-summarization ExplainaBoard222http://explainaboard.nlpedia.ai/leaderboard/task-summ/ and NLPProgress333https://nlpprogress.com/english/summarization.html provide an overview of state of the art in text summarization mainly according to ROUGE. These leaderboards simply aggregate the scores as reported by the models’ developers, where the reported scores can be obtained using different implementations. Hence, a fair comparison become less feasible. For instance, the Bottom-Up model gehrmann:2018 uses a different implementation of ROUGE,444https://github.com/sebastianGehrmann/rouge-baselines compared to the BanditSum model dong:2018.555https://github.com/pltrdy/rouge Besides, for a qualitative comparison of the models, one needs to manually inspect the generated summaries, which are missing from such leaderboards.

To address these shortcomings, VisSeq wang:2019 aids developers to locally compare their model’s outputs with the ground truth, providing lexical and semantic comparisons along with statistics such as most frequent n-grams and sentence score distributions. LIT tenney:2020

provides similar functionality for a broader range of NLP tasks, implementing a work-bench-style debugging of model behavior, including visualization of model attention, confusion matrices, and probability distributions. Closely related to our work is SummVis 

vig:2021, the recently published tool that provides a visual text comparison of summaries with a reference summary as well as a source document, facilitating local debugging of hallucinations in the summaries.

Summary Explorer draws from these developments and adds three missing features: (1) Quality-criteria-driven design. Based on a careful literature review of qualitative evaluation of summaries, we derive three key quality criteria and encode them explicitly in the interface of our tool. Other existing tools render these criteria implicit in their underlying design. (2) A step-by-step process for guided analysis. From the chosen quality criteria, we formulate concise and specific questions needed for a qualitative evaluation, and provide a tailored visualization for each question. While previous tools utilize visualization and enable users to (de)activate certain features, they oblige the users to figure out the process themselves, which can be overwhelming to non-experts. (3) Compilation of the state of the art. We collect the outputs of more than 50 models on three benchmark datasets providing a comprehensive overview of the progress in text summarization.

Summary Explorer complements these tools and also provides direct access to the state of the art in text summarization, encouraging rigorous analysis to support the development of novel models.

3 Designing Visual Summary Exploration

The design of Summary Explorer derives from first principles, namely the three quality criteria coverage, faithfulness, and position bias of a summary in relation to its source document. These high-level criteria are frequently manually assessed throughout the literature. Since their definitions vary, however, we derive from each criterion a total of six specific aspects that are more straightforwardly operationalized in a visual exploration (see Figure 1, Step 2). To render the aspects more directly accessible to users, each is “clarified” by a guiding question that can be answered by a tailored visualization. Below, the three quality criteria are discussed, followed by the visual design.

3.1 Summary Quality Criteria

Coverage

A primary goal of a summary is to capture the important information from its source document. Accordingly, a standard practice in summary evaluation is to assess its coverage of the key content paice:1990; mani:2001; jones:1999. In many cases, a comparison to the ground truth (reference) summary can be seen as a proxy for coverage, which is essentially the core idea of ROUGE. However, since it is hard to establish an ideal reference summary mani:1999, a comparison against the source document is more meaningful. Although an automatic comparison against it is feasible nenkova:2013; shafieibavani:2018, deciding what is important content is highly subjective peyrard:2019. Therefore, authors resort to a manual comparison instead hardy:2019. We operationalize coverage assessment by visualizing a document’s overlap in terms of content, entities, and entity relations with its summary. Content coverage refers to whether a summary condenses information from all important parts of a document, measured by common similarity measures; entity coverage contrasts the sets of named entities identified in both summary and document; and relation coverage does the same, but for extracted entity relations.

Faithfulness

A more recent criterion that gained prominence especially in relation to neural summarization is the faithfulness of a summary to its source document cao:2018; maynez:2020. Whereas coverage asks if the document is sufficiently reflected in the summary, faithfulness asks the reverse, namely if the summary adds something new, questioning its appropriateness. Due to their autoregressive nature, neural summarization models have the unique property to “hallucinate” new content kryscinski:2020; zhao:2020. This is what enables abstractive summarization, but also bears the risk of generating content in a summary that is unrelated to the source document. The only acceptable hallucinated content in a summary must be textually entailed by its source document, which renders an automatic assessment challenging falke:2019; durmus:2020. We operationalize faithfulness assessment by visualizing previously unseen words in a summary in context, aligned with the best-matching sentences of its source document.

Position bias

Data-driven approaches, such as neural summarization models, can be biased by the domain of their training data and learn to exploit common patterns. For example, news articles are typically structured according to an “inverted pyramid,” where the most important information is given in the first few sentences purdue:2019, and which models learn to exploit wasson:1998; kedzie:2018. Non-news texts, such as social media posts, however, do not adopt this structure and thus require an unbiased consideration to obtain proper summaries syed:2019. We operationalize position bias assessment by visualizing the parts of a document that are the source of its summary’s sentences, as well as the ones that are common among a set of summaries.

3.2 Visual Design

Figure 2: (a) Heatmap overview of 45 models for the CNN/DM corpus; ones selected for analysis are highlighted red. Thumbnail views of (b) the content coverage, (c) the entity coverage, (d) the relation coverage, (e) the position bias across models for a single document, (f) the position bias of a model across all documents. For brevity, only (b) shows the left-hand source document. The thumbnails give a birdseye view, and are not meant for close reading.

Guided Assessment

Summary Explorer implements a streamlined process to guide summary quality assessment, consisting of four steps (see Figure 1). (1) A benchmark dataset is selected. (2) A list of available summary quality aspects is offered each with a preview of its tailored visualization and its interactive use. (3) Applying Shneiderman’s (shneiderman:1996) well-known Visual Information-seeking Mantra (“overview first, zoom and filter, then details-on-demand”), an overview of all models as a heatmap over averages of several quantitative metrics is shown (Figure 2a), which enables a targeted filtering of the models based on their quantitative performance. The heatmap of average values paints only a rough picture; upon model selection, histograms of each model’s score distribution for each metric are available. (4) After models have been selected, the user is forwarded to the corresponding quality aspect’s view.

The visualizations for the individual aspects of the three quality criteria share the property that two texts need to be visually aligned with one another.666A visualization paradigm recently surveyed by yousef:2021. Despite this commonality, we abstain from creating a single-view visualization “stuffed” with alternative options. We rather adopt a minimalistic design for the assessment of individual quality aspects.

Coverage View (Figure 2b,c,d)

Content coverage is visualized as alignment of summary sentences and document sentences at the semantic and lexical level in a full-text side-by-side view. Colorization indicates different types of alignments. For entity coverage (relation coverage), a corresponding side-by-side view lists named entities (relations) in a summary and aligns them with named entities (relations) in its source document. For unaligned relations, corresponding document sentences can be retrieved.

Faithfulness View (Figure 3, Case A)

Hallucinations are visualized by highlighting novel words in a summary. For each summary sentence with a hallucination, semantically and lexically similar document sentences are highlighted on demand. Since named entities and thus also entity relations form a subset of hallucinated words, the above coverage views do the same. Also, in an aggregated view, hallucinations found in multiple summaries are ordered by frequency, allowing to inspect a particular model with respect to types of hallucinations.

Position Bias View (Figure 2e,f)

Position bias is visualized for all models given a source document, and for a specific model with respect to all its summaries in a corpus. The former is visualized as a text heatmap, where a gradient color indicates for every sentence in a source document how many different summaries contain a semantically or lexically corresponding sentence. The latter is visualized by a different kind of heatmap for 50 randomly selected model summaries, where each summary is projected on a single horizontal bar representing the source document. Bar length reflects document length in sentences and aligned sentences are colored to reflect lexical or semantic alignment.

Aggregation Options

Most of the above visualizations show individual pairs of source documents and a summary. This enables the close inspection of a given summary, and thus the manual assessment of a model by sequentially inspecting a number of summaries for different source documents generated by the same model. For these views, the visualizations also support displaying a number of summaries from different models for a relative assessment of their summaries.

4 Collection of Model Outputs

We collected the outputs of 55 summarization approaches on the test sets of three benchmark datasets for the task of single document summarization: CNN/DM, XSum and Webis-TLDR-17. Each dataset has a different style of ground truth summaries, ranging from semi-extractive to highly abstractive, providing a diverse selection of models. Outputs were obtained from NLPProgress, meta-evaluations such as SummEval fabbri:2021, REALSumm bhandari:2020, and in correspondence with the model’s developers.777We sincerely thank all the developers for their efforts to reproduce and share their models’ outputs with us.

4.1 Summarization Corpora

The most popular dataset, CNN/DM hermann:2015; nallapati:2016, contains news articles with multi-sentence summaries that are mostly extractive in nature kryscinski:2019; bommasani:2020. We obtained the outputs from 45 models. While the original test split of the dataset contained 11,493 articles, we discarded ones that were not summarized by all models, resulting in 11,448 articles total. This minor discrepancy is due to inconsistent usage by authors, such as reshuffling the order of examples, de-duplication of articles in the test set, choice of tokenization, text capitalization, and truncation.

For the XSum dataset narayan:2018, the outputs of six models for its test split (10,360 articles) were obtained. XSum contains news articles with more abstractive single-sentence summaries compared to CNN/DM. The Webis-TLDR-17 dataset voelske:2017 contains highly abstractive, self-authored (single to multi-sentence) summaries of Reddit posts, although slightly noisier than the other datasets bommasani:2020. We obtained the outputs from the four submissions of the TL;DR challenge syed:2019 for 250 posts.

Figure 3: Two showcases for identifying inconsistencies in abstractive summaries using Summary Explorer. Case A depicts the verification of the correctness of hallucinations by aligning document sentences. Case B depicts uncovering more subtle hallucination errors by comparing unaligned relations.

4.2 Text Preprocessing

In a preprocessing pipeline, the input of a collection of documents, their ground truth summaries, and the generated summaries from a given model were normalized. First, basic normalization, such as de-tokenization, unifying model-specific sentence delimiters, and sentence segmentation were carried out. Second, additional information, such as named entities and relations were extracted using Spacy888https://spacy.io and Stanford OpenIE angeli:2015, respectively. The latter extracts redundant relations where partial components such as either the subject or the object are already captured by longer counterparts. Such “contained” relations are merged into unique representative relations for each subject.

Alignment

Every output summary is aligned with its source document, identifying the top two lexically and semantically related document sentences for each summary sentence. Lexical alignment relies on averaged ROUGE-1,2,L scores among the document and summary sentences. The highest scoring document sentence is taken as the first match. The second match is identified by removing all content words from the summary sentence already captured by the first match, and repeating the process as per lebanoff:2019. For semantic alignment, the rescaled BERTScore zhang:2020a is computed between a summary sentence and all source document sentences, with the top-scoring two sentences as candidates.

Summary Evaluation Measures

Several standard evaluation measures enable quantitative comparisons and filtering of models for detailed analysis: (1) compression as the word ratio between a document and its summary grusky:2018, (2) n-gram abstractiveness as per gehrmann:2019 calculates a normalized score for novelty by tracking parts of a summary that are already among the n-grams it has in common with its document, (3) summary length as word count (not tokens), (4) entity-level factuality as per nan:2021 as percentage of named entities in a summary found in its source document, and (5) relation-level factuality as percentage of relations in a summary found in its source document. Finally, for consistency, we recompute ROUGE-1,2,L999https://github.com/google-research/google-research/tree/master/rouge for all the models.

5 Assessment Case Studies

We showcase the use and effectiveness of Summary Explorer by investigating two models (IMPROVE-ABS-NOVELTY, and IMPROVE-ABS-NOVELTY-LM) from kryscinski:2018 that improve the abstraction in summaries by including more novel phrases. We investigate the correctness of their hallucinations (novel words in the summary), and identify hidden errors introduced by the sentence fusion of the abstractive models.

Hallucinations via Sentence Alignment

Hallucinations are novel words or phrases in a summary that warrant further inspection. Accordingly, our tool highlights them (Figure 3, Case A), directing the user to the respective candidate summary sentences whose related document sentences can be seen on demand. For IMPROVE-ABS-NOVELTY, we see that the first candidate improves abstraction via paraphrasing, is concisely written, and correctly substitutes the term “offenses” with the novel word “charges”. The second candidate also improves abstraction via sentence fusion, where two pieces of information are combined: “bennett allegedly drove her daughter”, and “victim advised she thought she was going to die”. The novel word “told” also fits. However, the sentence fusion creates a wrong relation between the different actors (“bennett allegedly told her daughter that she was going to die”), which can be easily identified via the visual sentence alignment provided.

Hidden Errors via Relation Alignment

The above showcase does not capture all hallucinations. Summary Explorer also aligns relations extracted from a summary and its source document to identify novel relations. For IMPROVE-ABS-NOVELTY-LM, we see that the relation “she was arrested” is unaligned to any relation in the source document (Figure 3, Case B). Aligning the summary sentence to the document, we note that it is unfaithful to the source despite avoiding hallucinations (“Bennett was released on $10,500 bail”, and not “arrested on $10,500 bail”). The word “arrested” was simply extracted from the document sentence (Figure 3, Case A). Without the visual support, identifying this small but important mistake would have been more cognitively demanding for an assessor.

6 Conclusion

In this paper, we present Summary Explorer

, an online interactive visualization tool to assess the state of the art in text summarization in a guided fashion. In enables analysis akin to close and distant reading in particular facilitating the challenging inspection of hallucinations by abstractive summarization models. The tool is available open source

101010https://github.com/webis-de/summary-explorer enabling local use. We aim to expand the tool’s features in future work, exploring novel visual comparisons of documents to their summaries for more reliable qualitative assessments of summary quality.

7 Ethical Statement

Visualization plays a major role in the usage and accessibility of our tool. In this regard, to accommodate for color blindness, we primarily use gradient-based visuals for key modules such as model selection, aggregating important content, and text alignment. This renders the tool usable also in a monochromatic setting. Regarding the hosted summarization models, the key goal is to allow a wider audience comprising of model developers, the end users, and practitioners to openly compare and assess the strengths, limitations and possible ethical biases of these systems. Here, our tool supports making informed decisions about the suitability of certain models to the downstream applications.

References