Topic-Aware Abstractive Text Summarization

10/20/2020 ∙ by Chujie Zheng, et al. ∙ University of Maryland University of Delaware 0

Automatic text summarization aims at condensing a document to a shorter version while preserving the key information. Different from extractive summarization which simply selects text fragments from the document, abstractive summarization generates the summary in a word-by-word manner. Most current state-of-the-art (SOTA) abstractive summarization methods are based on the Transformer-based encoder-decoder architecture and focus on novel self-supervised objectives in pre-training. While these models well capture the contextual information among words in documents, little attention has been paid to incorporating global semantics to better fine-tune for the downstream abstractive summarization task. In this study, we propose a topic-aware abstractive summarization (TAAS) framework by leveraging the underlying semantic structure of documents represented by their latent topics. Specifically, TAAS seamlessly incorporates a neural topic modeling into an encoder-decoder based sequence generation procedure via attention for summarization. This design is able to learn and preserve global semantics of documents and thus makes summarization effective, which has been proved by our experiments on real-world datasets. As compared to several cutting-edge baseline methods, we show that TAAS outperforms BART, a well-recognized SOTA model, by 2 ROUGE-1, ROUGE-2, and ROUGE-L, respectively. TAAS also achieves comparable performance to PEGASUS and ProphetNet, which is difficult to accomplish given that training PEGASUS and ProphetNet requires enormous computing capacity beyond what we used in this study.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In today’s digital economy, we are facing a tremendous amount of information every day, which often leads to information overload and poses great challenges to efficient information consumption. As shown in Figure 1

, summarization enables a quick and condensed overview of the content and has been used in various applications to help users navigate in the ocean of content abundance. Summarization has been a widely studied topic in Natural Language Processing (NLP) area, where a short and coherent snippet is automatically generated from a longer text. An accurate and concise summarization is very critical to many downstream tasks, such as information retrieval and recommender systems. As illustrated by

(gu2020generating), automatic summarization by algorithms can reduce reading time, make users’ selection process easier, improve the effectiveness of indexing, be less biased than human summaries, and increase the number of texts consumer are able to process.

Researchers have been developing various summarization techniques that primarily fall into two categories: extractive summarization and abstractive summarization. Extractive summarization involves the selection of phrases and sentences from the source document to generate the new summary. It involves ranking the relevance of phrases in order to choose only those most relevant to the meaning of the source (narayan2018ranking; zhou2018neural; liu2019text). It does not modify any words. On the contrary, abstractive summarization generates entirely new phrases and sentences in a word-by-word manner to capture the meaning of the source text (lewis2019bart; zhang2019pegasus; yan2020prophetnet). This is a more challenging direction but consistent with what humans do in summarization, which holds the hope of more general solutions to this task. Thus, the present study focuses on abstractive summarization.

Figure 1. An example of text summarization22footnotemark: 2

Recently deep learning models have shown promising results in many domains. Inspired by the successful application of deep learning methods for machine translation, abstractive text summarization is specifically framed as a sequence-to-sequence learning task. Therefore, various sequence models, especially the game-changing transformer-based encoder-decoder framework, can come to help. Transformer has been used in a wide range of downstream applications, such as reading comprehension, question answering, and natural language inference, and has achieved astonishing performance. For example, the recent transformer-based GPT3 is considered the largest language model so far with a whopping 175 billion parameters and can produce amazing results in various tasks with zero or few shots learning. Similarly, most current state-of-the-art (SOTA) abstractive summarization methods also focus on novel self-supervised objectives in pre-training

(lewis2019bart; yan2020prophetnet; zhang2019pegasus). Within the contribution from attention mechanism (vaswani2017attention), transformer-based models can well capture the syntactical and contextual information among words in documents. However, little attention has been paid to incorporating semantic information at a global level to better fine-tune for the specific abstractive summarization task. In particular, the latent topics in documents should play a role in text summarization since the generated summary is expected to capture the key information of the source text. Many topic modeling methods (griffiths2004hierarchical; chemudugunta2008combining; li2006pachinko)

have been proposed to discover the latent semantic structures of a collection of documents. Each topic is distributed over words from the documents with their corresponding probabilities of belonging to a specific topic. The intuition behind our study in this paper is that by leveraging the topic association information of each word in the document, our model is able to assign more weights to words that are more likely to represent the key topics of the documents and thus generate better summarization.

Like prior studies, we also adopt a sequence-to-sequence model to generate summaries. A key component in a seq2seq model is how to represent and encode a source text. Current approaches include a sum-up approach and a self-attention approach. The sum-up approach summarizes all latent representations of an input sequence into one latent representation for decoding. Recent years have witnessed the prosperity of this approach in sequence modeling. However, this approach has three shortcomings that need to be addressed: (1) it is easy to generate fake dependencies due to the overly strong assumption that any adjacent interactions in a sequence must be dependent, which may not be true in the real world because there might exist some noisy information in a sequence; (2) it is likely to capture point-wise dependencies only while the collective/group dependencies are greatly ignored; (3) the importance of each individual input in a sequence is likely not to be equal, which inspires many research efforts on attention-based sequence modeling. Among which self-attention in transformer-style architecture is the most common and well-developed one. The short-term and long-term dependencies among inputs are well captured. Please refer to (devlin2018bert) for details. They seem to focus on capturing contextual information at a syntactical level, while the semantics are overlooked, which might significantly reduce the sequence modeling performance, especially for the summarization task.

Motivated by this, we intend to take the semantic structure of an input document into consideration, in particular the latent topics of a document. This can overcome the limitation that we only focus on local contextual information while all high-level semantics are neglected. Therefore, in our paper we bring this idea to the summarization task and propose a new framework named Topic-Aware Abstractive Summarization (TAAS). We believe that this design can help us find informative words for a more comprehensive representation of the input sequence (yang2020sdtm), which leads to a better summary. Our empirical experiments also demonstrate its effectiveness as compared to existing methods. Overall, the main contributions of this paper are three-fold:

  • We propose a new framework for abstractive summarization with topic information incorporated, which helps to capture the semantic information and provides guidance during generation to preserve the key information. This generic framework opens a new perspective in NLP and can be extended to other language tasks.

  • We implement topic-aware attention using topic-level features through neural topic modeling and transformer-based encoder-decoder, which efficiently extracts salient topics and understands the long-term dependency and informativeness of words in the input sequence.

  • We conduct experiments on a real-world dataset and compare the performance of our model with several state-of-the-art approaches to demonstrate the effectiveness of our model on the summarization task. We also discuss the impact of important hyperparameters in the model and different types of data on the performance.

2. Related Work

Three lines of research are closely related to our paper: attention mechanism, text summarization, and topic modeling.

Attention mechanism brings NLP to a new stage since its inception in (vaswani2017attention)

. At present, various models based on the attention mechanism have achieved breakthrough performance in many tasks of NLP. The idea of attention mechanism is inspired by the human visual attention, which allows us to focus on a certain region with ”high resolution.” The attention can be interpreted as a vector of importance weight. To predict or infer one element, such as a pixel in an image or a word in a sentence, we usually use an attention vector to estimate how strongly it is correlated with other elements. The element highly relevant to the target should be assigned with a higher weight while the irrelevant ones should be associated with lower weights. Using the language generation task as an example, the attention mechanism provides guidance on which how words contribute to the sequence generation, which has been commonly seen in many NLP applications

(devlin2018bert; radford2019language).

Text summarization is a widely studied topic in NLP. It aims to provide a high-level view of the input document to a short and concise summary. There are two types of summarization approaches: extractive and abstractive summarization. Early extractive methods formulate the problem as selecting a subset of sentences to capture the main idea of the input document, using handcrafted features and graph-based structural information (narayan2018ranking; zhou2018neural; liu2019text; gu2020generating)

. With the advancement of seq2seq models, the encoder-decoder network has been shown promising ability to generate the abstractive summary. In this framework, the encoder obtains a comprehensive representation for the input sequence, and the decoder generates the output summary based on the latent representation. Recurrent neural network (RNN) and Transformer are commonly adopted in the encoder-decoder network. In particular, the Transformer along with the attention mechanism has become a state-of-the-art standard in both academia and industry. Several variants have achieved promising results in the summarization task. For example, BART

(lewis2019bart)

is such a model and very effective for text generation tasks where it implements the bidirectional encoder and the left-to-right autoregressive decoder. PEGASUS

(zhang2019pegasus) and ProphetNet (yan2020prophetnet) introduce different pre-training objectives for text summarization. PEGASUS masks important sentences and generates those gap-sentences from the rest of the document as an additional pre-training objective. ProphetNet predicts the next tokens simultaneously at each time step, which encourages the model to plan for the future tokens. One common drawback in these Transformer-based summarization models is that higher-level global semantic structure in the text is usually ignored, such as latent topics, which motivates our study that designs topic-aware attention for summarization.

Topic model is an important component in the TAAS. It discovers semantically relevant terms that form coherent topics via probabilistic generative models (meng2020discriminative) in an unsupervised manner. one basic assumption among various topic models is that a document is a mixture of topics and each topic is distributed over words in the vocabulary of the corpus. To learn these distributions, Latent Dirichlet Allocation (LDA) is introduced by imposing latent variables with Dirichlet prior (griffiths2004hierarchical)

. Recently, the development of deep generative networks and stochastic variational inference enables neural network-based topic modeling that is proven to be effective

(miao2017discovering). Among which auto-encoding variational Bayes provides a generic framework for deep generative topic modeling, especially the variational auto-encoder (VAE) that consists of a generative network and an inference network (yang2020graph). This framework severs as an important foundation for many studies in this field. For example, the Neural variational document model (NVDM) applies VAE for unsupervised document modeling with bag-of-words document representation (miao2016neural). Gaussian Softmax Model (GSM) extends NVDM by constructing the topic distribution with a softmax function, which is applied to the projection of the Gaussian random vectors (miao2017discovering). In recent years, topic modeling has also been extended to other NLP tasks, including text summarization (mao2019facet). They are different from ours in that we develop a topic-attention via Neural Topic Modeling and seq2seq for text summarization.

3. Preliminaries

In this section, we first formally define the summarization task with key notations and concepts explained. We then briefly review sequence-to-sequence Transformer-based architecture, which our TAAS is built upon. Notations throughout the paper are listed in Table 1.

Symbol Representation
Input sequence
Output from hidden state for x
Output Sequence
s Latent representation for x
Topic-word distribution
Topic embedding
topic attention of token under topic
Topic attention for x
Table 1. Notations

3.1. Problem Definition

Automatic text summarization aims at condensing a document to a shorter version while preserving the key information. Let be an input document with tokens and is the word embedding for the -th token. Given x, our TAAS model learns a function that maps x to another sequence of tokens , where y is the summary with tokens. This automatic generation process is achieved by maximizing the probability via the beam search algorithm. is usually implemented by neural networks or Transformer parameterized by .

3.2. Seq2seq Transformer Architecture

The key idea behind the sequence-to-sequence model is to represent an input sequence as a low-dimensional vector while preserving the contextual information in the sequence as much as possible, upon which a new task-specific sequence with an arbitrary length can be automatically generated. In practice, it usually consists of an encoder and a decoder where the encoder encodes key information from the input sequence x and generates a contextualized representation s, which is the input to the decoder. Taking a single encoder layer as an example, given an input sequence , we can obtain , where each is the hidden state from that encoder layer for the input .

Given is a learned representation for input token , one of the typical approach to obtain a sequence-level representation s is to sum up all word-level representations as shown in Eq. 1. Note that this strategy ignores the complicated relationships among all token-level latent representations ’s.

(1)

Different approaches to obtain a sequence-level representation have been proposed. For example, (devlin2018bert) adds a special token ”¡CLS¿” located at the beginning of a sequence. The final hidden state corresponding to this token by the encoder as is used as the aggregate sequence representation. We adopt this approach in this paper.

To transform x to h

, we leverage the sequence-to-sequence Transformer network. Given an input sequence

, Transformer calculates multi-head self-attention for mapping one variable-length sequence of symbol representation x to another sequence of equal length , with (vaswani2017attention), where is the hidden size. Using single head attention as an example, Transformer first multiplies , and to the input sequence to find the query matrix Q, the key matrix K, and the value matrix V, like Eq. 2.

(2)

Then, the attention is calculated according to Eq. 3.

(3)

Thus, the latent representation of an input sequence s can be written as follows, which is also the output from the hidden state for the first token ”¡CLS¿”.

(4)

4. Topic-Aware Abstractive Summarization (TAAS)

In this section, we describe the details of our proposed model TAAS, which considers both syntactical and semantic structures of text for abstractive summarization. As illustrated in Fig. 2, TAAS consists of three major components:

Figure 2. Overview of our proposed Topic-Aware Abstractive Summarization model (TAAS)

(A) Neural topic modeling

. It is a deep learning-based topic model, where the variational autoencoder (VAE) is implemented to learn latent topic vectors

t (document-topic distribution) and (topic-word distribution) via two networks (i.e., encoder and decoder) in a generative manner.

(B) Topic-aware attention. To incorporate the latent structure of documents at a semantic level into a subsequent seq2seq-based summarization model, we introduce topic-aware attention to understand the impact of words on the summarization. We believe that such a design can help our model capture global information by leveraging the topic features learned from the neural topic modeling.

(C) Encoder-decoder-based sequence modeling. The transformer-based encoder-decoder framework is employed to understand complicated syntactical features in the text. The hidden state generated by the encoder, along with the topic attention is used to calculate a latent representation where the contextual and global information is captured. This representation is the input to the decoder for the output summary generation.

4.1. Neural Topic Model

Our topic-weighted attention is built on extracting the latent topic information through neural topic model (NTM) (miao2017discovering; srivastava2017autoencoding). NTM is based on variational auto-encoder (VAE) (kingma2013auto), involved with a continuous latent variable z as latent topics.

Given an input sequence document d contains tokens, the latent topic variable corresponds to the topic proportion of document d. Here denotes the number of topics. is the topic assignment for the observed word . NTM implements a VAE to learn latent topic vectors via two networks: a generative network and an inference network. The generative network (encoder) is a compressor that transforms the input text data into a latent representation, i.e., a latent topic vector, while the inference network (decoder) is a reverter that reconstructs the latent representation back to the original input. Such a design of using neural networks to parameterize the multinomial topic distribution can eliminate the need to predefine distributions to guide the generative process. It only requires specifying a simple prior (e.g., a diagonal Gaussian) (miao2017discovering). Therefore, the overall generative process for document d can be written as Eq. 5:

(5)

Here we pass a diagonal Gaussian distribution with mean

and variance

to parameterize the multinomial document topic distribution (miao2017discovering) and build an unbiased gradient estimator for the variable distribution (yang2020sdtm). and are trainable parameters. All parameters involved in this generative process are denoted by . The inference network is to approximate the true distribution using a diagonal Gaussian , that is parametrized by . We use three fully connected networks , and to represent and as and .

Overall, we use variational inference (blei2017variational) to approximate the posterior distribution over

. The loss function is defined as the

negative of variational lower bound (ELBO) (miao2017discovering). In Eq. 6, and are probabilities for encoding and decoding processes. is a standard Normal prior .

(6)

4.2. Topic-Aware Attention

To incorporate the document-level semantics embedded in latent topics into the encoder-decoder seq2seq model, TAAS introduces a topic-aware attention mechanism from which two components are connected to enrich the representation of the input sequence for better summarization. As we mentioned above, it is very different from prior studies where they either develop a sum-up or a self-attention approach to representing the input document. These past research emphasize contextual information at a syntactical level, while the semantics are neglected, which can deteriorate the sequence modeling performance, especially for the summarization. This motivates us to design a topic-aware attention-based approach (see Figure 3). Specifically, it is designed as follows.

Figure 3. Architecture of topic attention

From NTM, we obtain a topic-word distribution , where and denote the number of topics and the vocabulary size, respectively. Given a topic embedding and the hidden state of an input sequence, we calculate the attention weight a as:

(7)

where is the attention weight for the token under the topic . is the output of the last hidden layer of the encoder network, and . However, this simple attention design has two drawbacks: (1) the trained model is not able to generalize to unseen documents that have some words that never appear in the training vocabulary; (2) the dimensionality of and mismatches because the is usually much larger than the hidden size . To overcome these limitations, we add a transformation component (i.e., mapping to

) realized by a fully connected feed-forward network, with a residual connection

(he2016deep) followed by a layer normalization (ba2016layer):

(8)

saves us from being confined to pre-defined vocabulary, where is the hidden size of the encoder. is considered as the topic embedding, which carries latent features and information for the -th topic. Now the attention weight a can be rewritten as:

(9)

Further, for every token , we average the attention weight over topics to get a topic-aware attention as:

(10)

This weight is then normalized via softmax to obtain our final attention as:

(11)

We denote this final topic attention as .

4.3. Encoder-Decoder Sequence Modeling

Our final sequence-to-sequence model uses a standard encoder-decoder framework where the encoder generates a contextualized latent representation s of the input sequence x, which is the input to the decoder. Then the decoder outputs a summary in a sequential manner. Specifically, the input sequence

is sent to the first layer of the encoder, where a hidden representation

is generated. For the rest of the layers in the encoder , the output from the previous layer serves as the input of the current layer. The final state of the encoder along with the topic attention is used to generate the latent representation s which serves as the initial hidden state for the decoder.

In previous works, recurrent neural network (RNN) and self-attention in Transformer are two most widely used architectures for the encoder-decoder network. One major weakness of this RNN-based approach lies in that the contextualized representation s has a short-term impact on the generated sequence, because it is only used at the beginning of the generation process. To address this challenge, the attention mechanism is introduced to take the entire encoder context into account (jurafsky2014speech). As illustrated in Eq. 12, s is available during decoding by conditioning the current decoder state on it. is a stand-in for self-attention calculation and is the word embedding for the output sampled from the softmax at the previous step.

(12)

Motivated by the great success of the Transformer-based model like BERT (devlin2018bert) in recent years, we employ this sequence-to-sequence Transformer architecture for our abstractive summarization task. Given the topic attention and the hidden state h (i.e., with the superscript omitted), the latent representation s is calculated as:

(13)

Parameter estimation TAAS consists of two objectives from NTM and encoder-decoder sequence modeling that need to be jointly optimized. The objective function of TAAS is defined as Eq. 14. is the negative of ELBO, defined in Eq. 6 and is the cross-entropy loss between the predicted output of decoder and the true summary. is a hyper-parameter that balances the importance between NTM and the encoder-decoder. The overall process of TAAS is sketched in Algorithm 1.

(14)
0:  input sequence
0:  summary
1:  # training phase
2:  for all  do
3:     for all  do
4:        # each batch
5:        ; # topic-word distribution
6:        ;
7:        ; # hidden states from encoder
8:        ;
9:        calculate using Eq. 10 and Eq. 11; # topic attention
10:        obtain latent representation ;
11:        ; # sequence generation - summary
12:        update parameters of TAAS: ;
13:        # : learning rate, : objective function (see Eq.14)
14:     end for
15:  end for
16:  # test phase
17:  repeat steps 5 – 11 for x and the learned parameters .
Algorithm 1 TAAS: topic-aware abstractive summarization

5. Experiments

In this section, we first describe the dataset, evaluation metrics, and parameter settings. Then, we conduct several experiments to compare TAAS against the state-of-the-art text summarization models. Parameter sensitivity and model ablation analysis are also discussed. Codes are publicly available here

333https://github.com/taas-www21/taas.

5.1. Experimental Settings

Dataset The data we use in this study is the CNN/Daily Mail (CNN/DM) dataset (see2017get), which contains 93k news articles from CNN and 220k articles from Daily Mail. The text in the dataset has 781 tokens on average, paired with multi-sentence summaries (3.75 sentences or 56 tokens on average), which serve as the ground truth for our summarization task. The training, validation, and test sets include 287,113, 13,368 and 11,490 data pairs respectively.

Evaluation Metrics Following existing works (lin2004rouge)

, we use ROUGE for summarization performance evaluation. ROUGE measures the overlapping between the generated summary and the ground-truth summary. ROUGE-N and ROUGE-L are the most commonly used ROUGE metrics in practice, which stand for ROUGE N-gram and longest common subsequence (LCS). Recall in the context of ROUGE means how much the generated summary covers the ground-truth summary, whereas precision measures how much of the generated summary was in fact included in the ground-truth summary. In this study, we report the F1 score of ROUGE-1, ROUGE-2 and ROUGE-L for every experiment for performance comparison.


Parameter Setting Our experiments are conducted on a machine with two GeForce RTX 2080 Ti GPUs. Due to the memory limitation, the batch size for training and testing is set to 32. Following the suggestion by HuggingFace444See their implementation at https://github.com/huggingface/transformers/tree/master/examples/seq2seq, we freeze parameters in the encoder and the token embedding while only fine-tuning the decoder.

Model ROUGE-1 ROUGE-2 ROUGE-L
Rule-based Lead-3 40.07 17.68 36.33
Extractive REFRESH 40.00 18.20 36.60
SUMMARUNNER 39.60 16.20 35.30
Abstractive DRM 39.87 15.82 36.90
PGN 39.53 17.28 36.38
T5 37.88 15.49 32.22
BART 40.41 17.79 33.86
Our method TAAS 41.19 19.20 37.94
(1.9%) (7.9%) (2.8%)
Industry SOTA ProphetNet 41.45 20.03 38.75
PEGASUS 44.06 21.36 41.17
Table 2. Performance comparison on CNN/DM dataset regarding F1 score of ROUGE.

Benchmark Methods To evaluate whether the topic-aware attention-based design is effective in text summarization, we compare our TAAS model with the following methods, which do not incorporate the semantic structure of the documents and can be grouped into rule-based, extractive, and abstractive categories.

  • Lead-3 is a simple rule-based method that chooses the first three sentences from a document as its summary.

The following are extractive summarization methods.

  • SUMMARUNNER (nallapati2016summarunner)

    is a two-layer bi-directional GRU-RNN based sequence model for extractive summarization. It formulates the summarization problem as a sequence classification problem. For each sentence, a binary classifier is learned to decide if it is included.

  • REFERSH (narayan2018ranking) formulates the extractive summarization problem as a ranking task among all sentences of an input document. It uses LSTM to select sentences from the input documents.

The following are abstractive summarization methods.

  • DRM (paulus2017deep) introduces a neural network model with a novel intra-attention that attends over the input and continuously generates output separately. The model reads the input sequence with a bi-directional LSTM encoder and a single LSTM decoder to generate the summary.

  • Pointer-Generator Network (PGN) (see2017get) constructs a pointer-generator network for summarization, which copies words from the source text to aid accurate reproduction of information and retains the ability to produce novel words through the generator. This novel framework can be viewed as a balance between extractive and abstractive approach.

  • T5 (raffel2020exploring) is a unified framework that converts every language problem into a text-to-text format. It is pretrained with a large language corpus and the framework can be adjusted for different language tasks including summarization.

  • BART (lewis2019bart) employs the bidirectional encoder to enhance the sequence understanding and the left-to-right decoder to generate the summary.

  • ProphetNet (yan2020prophetnet): predicts the next tokens simultaneously based on previous context tokens at each time step. This design encourages the model to plan for the future generation process.

  • PEGASUS (zhang2019pegasus) introduces a new pre-train objective to not only mask tokens but also mask some important sentences, which enables the model to capture global information among sentences and thus be able to generate candidate sentences using the surrounding sentence context.

The main goal of this study is to leverage topic-level semantics to help with the summarization task. Thus, our training is built on top of the ”sshleifer/distilbart-cnn-12-6” checkpoint released by HuggingFace for the BART model 555https://huggingface.co/sshleifer/distilbart-cnn-12-6. TAAS uses a similar architecture to BART: a 12-layer encoder with a bi-directional Transformer layer and a 12-layer decoder with a uni-directional Transformer layer. We add one topic-aware attention layer between encoder and decoder to incorporate the semantic structure.

During our experiment, we set in Eq. 14 to 0 to only focus on the loss from the sequence modeling part while the parameters in NTM are fixed. Note that could be tuned to balance the loss from both NTM and sequence modeling, which we leave as future research. Due to the limitation of our computing capacity, we calculate the topic-word distribution using samples within each batch. We use Adam optimizer with learning rate of , , (Note that the , here are the hyperparameters of Adam optimizer, which are not related to the topic-word distribution ). We use a dropout rate of across all layers.

5.2. Experiment Results

Performance comparison Table 2

shows the performance comparison between our TAAS model and aforementioned benchmark models on the CNN/DM dataset. We separate ProphetNet and PEGASUS into the industry SOTA category given their top positions on the CNN/DM leaderboard. Note that we cannot reproduce their performance using the hyperparameter settings from the original papers given our limited computing capability and thus report their scores using the same hyperparameter settings of our TAAS model. We show TAAS performance (denoted using bold text) and the improvement percentage over the second-best performing models (denoted as underlined text) except for the industry SOTA models. We have the following observations. (1) TAAS outperforms the naive rule-based method Lead-3. This is expected where only looking at the first 3 sentences of an article is not able to capture all key information because different reporters have different writing styles. Some prefer to summarize all important topics at the beginning while others like to use an inductive approach to leave the important information at the end. Using Lead-3 for the latter case will make the summary inaccurate. (2) TAAS has a superior performance over recent state-of-the-art academic models (both extractive and abstractive). Since our model uses a similar architecture to BART, we particularly make a comparison against BART and find that it improves the performance by 1.9%, 7.9% and 12% regrading the F1 measure of ROUGE-1, ROUGE-2 and ROUGE-L, respectively. This confirms that incorporating global semantic structures, such as latent topics of text indeed generates better summarization. (3) The improvements of all models over the simple rule-based Lead-3 model in terms of the F scores are not amazingly large. This may due to the fact that the writing style of news articles could have a pattern that the first few sentences kind of summarize the whole article. More experiments should be conducted on other summarization datasets to study the effects of our model on other types of documents.


Figure 4. A qualitative evaluation of TAAS and BART on attention and summary
Truth
Bob Barker returned to host ”The Price Is Right” on Wednesday. Barker, 91, had retired as host in 2007.
Generated summary
T5
”The Price Is Right” returned to hosting for the first time in eight years. Despite being away from the
show for most of the past 8 years, a television legend didn’t seem to miss a beat. Bob Barker hosted
the game show for 35 years before stepping down in 2007.
BART
Bob Barker hosted the TV game show for 35 years before stepping down in 2007. Barker handled the
first price-guessing game of the show before turning hosting duties over to Drew Carey. Despite being
away from the show for most of the past eight years, Barker didn’t seem to miss a beat.
ProphetNet
A TV legend returned to doing what he does best. Contestants told to ”come on down!” on the April 1
edition of ”The Price Is Right” encountered not host Drew Xarey.
PEGASUS
Barker hosted ”The Price Is Right” for 35 years. He stepped down in 2007.
TAAS
Bob Barker returns to hosting ”The Price Is Right” for the first time in eight years. The 91-year-old
TV legend stepped down from the show in 2007 after 35 years on the show, which he hosted for 35 years.
Table 3. Example summaries generated by various models

Qualitative evaluation We conduct a qualitative evaluation to intuitively further demonstrate: 1) word attention weights are indeed changed by our topic-aware attention mechanism, and 2) our TAAS model can generate more novel sentences that are more related to the document topics. As shown in Figure 4, we highlight (yellow color) the top five phases with the highest attention values using topic-aware attention in TAAS and self-attention in BART to illustrate the word level attention differences. We also provide the generated summary for comparison. Important phases by the self-attention focus on the leading part of articles, which fails to capture the important information in the text. The generated result is also less meaningful as compared to that generated by TAAS. Although summaries by both approaches have successfully covered the keywords in the news, TAAS generates more clear and coherent results, and especially it summarizes the key ideas correctly. Since both models target an abstractive summarization task, we further assess the extent to which models are able to perform rewriting by generating an abstractive summary. The result from BART is less informative which uses several original sentences from the news, e.g., a whole sentence highlighted in purple in Figure 4. TAAS not only captures the key information of the article well but also demonstrates a novel sentence structure. We also observe some duplicated information in the summary generated by TAAS. For example, ”after 35 years on the show” conveys the same information as ”which he hosted for 35 years”, which indicates additional room for improvements in our future research. As an additional comparison, we also present the summaries of the same article by other baseline models, shown in Table 3.

6. Discussion

6.1. Effect of Different Number of Topics

Figure 5. The impact of the number of topics on model performance

TAAS achieves relatively superior performance over several baselines, demonstrating the effectiveness of incorporating the topic-aware attention into summarization. Like many deep learning models, TAAS involves many hyperparameters to which the performance might be sensitive. Among which the number of topics is a critical one that needs more exploration. To do so, we vary the value of ranging from 5 to 100 to obtain F scores of ROUGE-L. Here we follow prior literature to only report F score of ROUGE-L (gu2020generating). Note that every reported score here is calculated on the test set. The result is shown in Figure 5), from which we have the following observations. (1) TAAS achieves the best performance when . This is reasonable given the fact that each batch during training has 32 articles where the number of latent topics should not be too many or too few. (2) The ROUGE-L F scores for different ’s don’t vary too much, which might be due to several reasons: (i) The balancing NTM and encoder-decoder sequence modeling part is set to 0, as we focus on summarization; (ii) Each generated summary includes top-attended words that are likely to be similar even varying ’s given the nature of the dataset where it is unlikely to exhibit diverse topics for 32 articles in a batch.

Figure 6. Performance comparison (F1 score of ROUGE) of TAAS and BART on different lengths of articles

6.2. Effect of Different Document Lengths

To test whether TAAS performs equally well on different lengths of articles, we separate the entire CNN/DM dataset into three subsets based on the number of sentences, which are denoted as CNN/DM-short (¡ 19 sentences), CNN/DM-medium (19 – 30 sentences), and CNN/DM-long (¿ 30 sentences) with 4021, 3841, and 3628 articles, respectively. As shown in Figure 6, TAAS achieves better performance for longer articles, up to 13.5% improvement of ROUGE-L F score over BART as compared to 12.6% and 11.6% for short and medium articles. Longer articles are likely to have more diverse topics which make the topic attention in the summarization model more salient. This further indicates that adding topic-level information can improve the model in the summarization task.

7. Conclusion

In this work, we study the abstractive text summarization problem by proposing a topic-aware attention model to incorporate global semantic structures of text. In particular, we combine neural topic modeling and encoder-decoder like sequence-to-sequence model via topic attention for summarization. We conduct extensive experiments on a real-world dataset to compare our proposed approach with several cutting-edge methods. The results demonstrate the superior performance over some well-recognized models in academia and comparable performance to industry SOTA. We also shed light on how the model performance is affected by important hyperparameters and the characteristics of textual data. Although our current study in this paper incorporates topic-word distribution into the framework, the other important output from NTM, i.e., document-topic distribution, is currently neglected, which might be worth exploring in our future work. Furthermore, training our model with more powerful computing resources to improve the summarization performance and test the robustness of the model is always interesting to pursue.

References