Modern Natural Language Processing (NLP) methods based on deep neural networks have achieved remarkable performance in several different tasks[Vaswani2017AttentionNeed, Radford2019LanguageLearners, Devlin2018Bert:Understanding, liu2019roberta, Lewis2019BART:Comprehension, Raffel2020ExploringTransformer]. Such performance levels are usually achieved by scaling up deep neural networks to millions or even billions of parameters. Scaling, in turn, requires extremely high computational capacity and large training datasets.
Abstractive text summarization is an NLP task that has drawn extensive attention in recent deep learning work, with some very significant achievements [Lewis2019BART:Comprehension, Zhang2020PEGASUS:Summarization, zaheer2020big]. State-of-the-art summarization models generally depend on large supervised datasets to achieve good performance and generalize to unseen data. Summarization datasets are usually document collections, each accompanied by some form of summary, typically written by humans.
Although numerous such datasets are available in the literature [Napoles2012AnnotatedGigaword, Hermann2015TeachingComprehend, Narayan2018DontSummarization, Grusky2018Newsroom:Strategies], this is not the case in many practical applications, since constructing large supervised datasets is very expensive and time consuming. Collecting good quality training data in large amounts and annotating them for summarization would be costly for many small businesses trying to adopt summarization technology to solve problems in their respective domains. This cost can be particularly high if domain expertise is required for annotation, which is true for many use cases such as financial and legal documents.
Active learning methods have been widely adopted in an effort to reduce deep learning data requirements for various tasks [Houlsby2011BayesianLearning, Gal2017DeepData]. Strategically selecting for annotation the most informative samples has proven to be more effective than random selection when the budget for annotating data is small [Siddhant2020DeepStudy]. Active learning has also been applied to NLP problems but it has rarely been explored from a summarization perspective[Zhang2009ActiveSummarization].
We present Bayesian Active Summarization (BAS), an approach for applying active learning to abstractive text summarization aiming to mitigate the data dependence of summarization models. BAS iteratively alternates between annotating and training, in an effort to maximize the gains from a limited data annotation budget. Based on previous work [Gidiotis2021Uncertainty-AwareSummarization], we apply Bayesian deep learning methods with Monte Carlo (MC) dropout [Gal2016DropoutLearning] to quantify summarization uncertainty, and use it to select data samples for annotation.
We empirically show that BAS is more data efficient than random selection, and achieves better overall performance when the annotation budget is small. More specifically, we conducted experiments using the PEGASUS summarization model [Zhang2020PEGASUS:Summarization] and managed to reach 95% of the performance of a PEGASUS model trained on the full XSum dataset[Narayan2018DontSummarization] using less than 150 training samples.
Finally, we analyze BAS with regard to computational cost, in an effort to identify the effects different design choices have. This analysis gives us insights into the trade-off between performance and computational complexity, helping us understand how to scale BAS effectively.
The rest of this paper is structured as follows. Section 2 is a review of relevant work in active learning, Bayesian methods and their application to NLP. In Section 3 we briefly introduce Bayesian uncertainty and it’s extensions to summarization models. Then, in Section 4 we present in detail the main BAS algorithm and in Section 5 some practical considerations concerning scalability and robustness. In Section 6 we describe our experimental setup and in Section 7 we discuss our main findings, including an analysis of BAS from different angles. We conclude with some final remarks in Section 8.
2 Related work
A number of works have applied active learning methods to NLP problems. [Shen2018DeepRecognition]Siddhant2020DeepStudy] empirically study multiple different active learning methods, focusing on sentence classification and NER. Various works combine BERT [Devlin2018Bert:Understanding] with different active learning methods on NLP tasks like text classification [Ein-Dor2020ActiveStudy], NER [Liu2020LTP:Recognition] and natural language understanding (NLU) [Griehaber2020Fine-tuningLearning].
Bayesian active learning methods have been successfully applied to various problems, ranging from computer vision[Kendall2017WhatVision, Litjens2017AAnalysis, Gal2017DeepData] to NLU [Griehaber2020Fine-tuningLearning] and NER [Siddhant2020DeepStudy]. Most of the aforementioned methods use BALD [Houlsby2011BayesianLearning] variations and MC dropout [Gal2016DropoutLearning] in order to acquire samples for annotation.
Closely related to our work, [Lyu2020YouAnswering] uses the BLEUVar disagreement [Xiao2020WatTransformers] between samples generated with MC dropout as an uncertainty metric for selecting samples to annotate. They mainly focus on the question answering task, and show that performance can be improved by training on a subset of the most uncertain samples.
In stark contrast, active learning methods have seen limited adoption in summarization. [Zhang2009ActiveSummarization] proposes an active learning method that utilizes additional resources, and selects documents based on their similarity with PowerPoint slides from corresponding presentations. They then annotate the selected documents and use them to train an extractive summarization model.
Another interesting work by [Schick2020Few-ShotTraining] tries to reduce summarization data requirements without the use of active learning. They propose augmenting the inputs to the PEGASUS model [Zhang2020PEGASUS:Summarization] in order to make fine-tuning more effective. They achieve considerable improvements in few-shot summarization performance on multiple benchmark datasets. Nevertheless, their work assumes that only a few, between 10 and 100, annotated samples are available while we focus on actively selecting samples for annotation. Consequently, our methods are not directly comparable to their approach.
Finally, our previous work [Gidiotis2021Uncertainty-AwareSummarization]
suggests using MC dropout to estimate summarization models’ uncertainty. Similar to[Lyu2020YouAnswering], multiple stochastic summaries are generated and the BLEUVar disagreement is computed between those summaries, as an uncertainty estimate. Inspired by that work, we apply summarization uncertainty in the context of active learning, to acquire highly uncertain documents at each annotation round.
3 Bayesian uncertainty
In this section we briefly introduce some key concepts about Bayesian uncertainty, which is the foundation of our document selection strategy.
3.1 Monte Carlo dropout
Standard deep learning models are notorious for their highly confident predictions, even for inputs lying further away from their learned distribution [Gal2016DropoutLearning, Xiao2020WatTransformers]
. In contrast, Bayesian probabilistic models can explicitly model uncertainty based on the models’ predictive distribution variance. Finding the model’s predictive distribution involves deriving the entireposterior distribution over all model parameters given training data and (Equation 1), and at test time making predictions by integrating over all possible values (Equation 2).
Although integrating over all possible values is intractable for deep neural networks, alternative methods can be used to approximate it during inference. Monte Carlo dropout [Gal2016DropoutLearning] is one such method that performs multiple stochastic forward passes with dropout [Srivastava2014Dropout:Overfitting]
turned on at test time, which is equivalent to drawing from the model’s predictive distribution. It is a practical method to approximate Bayesian inference, readily applicable to various deep learning models.
3.2 Application to summarization
When applied to text summarization [Gidiotis2021Uncertainty-AwareSummarization], MC dropout can be used to estimate uncertainty. The uncertainty estimation process for summarization models involves two steps. First, stochastic summaries are sampled for a given input by running forward passes with different dropout masks. Then, the disagreement between summaries, measured by BLEU Variance (BLEUVar) [Xiao2020WatTransformers] between all summary pairs, can be used to estimate summarization uncertainty. In practice, BLEUVar is computed by summing the squared complement of BLEU [Papineni2002BLEU:Translation] among all summary pairs generated with MC dropout as shown in Equation 3.
4 Active summarization
The main objective of Bayesian active summarization (BAS) is to train a summarization model that achieves competitive performance, but requires less supervised data for training. Since creating large numbers of samples for summarization training can be particularly difficult and costly, we focus on training budgets of only a few hundreds of annotated samples.
Active learning methods, and particularly Bayesian ones, are known to have significant computational overheads. When applied to the abstractive summarization task, which also has high resource requirements, such methods can lead to very high computational costs. A secondary objective of this work is to develop a practical and effective method that is also resource efficient.
For BAS we follow the well established active learning paradigm, that strategically selects data to annotate over alternating rounds of labeling and training [Cohn1996ActiveModels, Houlsby2011BayesianLearning, Siddhant2020DeepStudy]. The full active summarization algorithm is shown in Algorithm 1. In Table 1 we collect the notation used in this section.
We start off with a learner , which is a standard abstractive summarization model. Initially, we have a pool of unlabeled documents and an empty set of labeled documents . Since our goal is to achieve strong performance with as few annotated documents as possible, it makes sense for the initial learner to be some pre-trained model, such as BART [Lewis2019BART:Comprehension] or PEGASUS [Zhang2020PEGASUS:Summarization], but in principle any neural summarization model could work.
During training we also need a separate annotated set, which will be used for validation and early stopping. Following what was suggested in [Siddhant2020DeepStudy], we keep a separate validation set of labeled documents, which is proportional to our total annotation budget. It would be unrealistic to have a few hundred examples for training but thousands of annotated examples for validation.
First, we warm-start our learner by training on a small sample of randomly selected and annotated (summarized) documents from . This part is important because it essentially “introduces" the summarization task to the learner. In practice, a pre-trained Transformer based model would require a few dozens of examples for this initial learning stage. The initial training documents are also added to the labeled set and removed from .
The rest of the learning iteratively alternates between labeling (summarizing) and training, where each full iteration is called a learning step. The process is repeated for multiple steps, until we reach satisfactory performance or we exhaust our annotation budget.
The labeling phase in each learning step is split into two parts. First, we have the learner generate stochastic summaries with MC dropout for each document in the unlabeled pool , and based on these summaries we compute the BLEUVar uncertainty score for each document. Then, the top documents with the highest uncertainty are selected and target summaries are retrieved for them.
After the labeling phase, we proceed to the training phase. The documents annotated in the previous phase are added to the labeled set , and the learner is trained on the whole labeled set. At each learning step, we train the learner from scratch in order to avoid overfitting the data samples collected in earlier steps [Hu2019ActiveFeedback]
. We also make sure to train the learner for a sufficient number of epochs, with early stopping based on performance on a validation setconsisting of samples, in order to avoid underfitted learners [Mukhoti2018OnLearning].
|number of samples in the labeled set|
|number of samples in the unlabeled set|
|number of validation samples|
|sampled set to be ranked by uncertainty|
|number of samples to be ranked|
|set of documents selected based on high uncertainty|
|number of documents selected in each step|
|number of MC dropout samples|
|number of documents for “warm-start" training|
|pre-trained summarization model|
|summarization model after “warm-start" training|
|summarization model trained at step|
5 Practical considerations
In this section we discuss practical considerations that arise when trying to apply active summarization in real world scenarios. These considerations concern important decisions we need to make with each applications’ specific requirements in mind, and have a great impact on BAS’s practical performance, cost, scalability and robustness.
5.1 Trading off precision for speed
Active learning methods are usually trading off data collection and annotation costs for increased computational costs. Although collecting and annotating data is usually the most costly and time consuming part, having very high computational costs is generally sub-optimal.
Let’s assume that we have a training budget of b samples and the cost of generating a single summary is constant and equal to while the cost of computing BLEU for a pair of texts is equal to . Equation 4 gives us the cost of ranking and selecting the most uncertain samples out of the full unlabeled set of documents.
If we also assume that the cost of each training step is constant and equal to and the cost of the initial, warm-start, training is , the total cost of BAS is given by Equation 5.
We can see that the BAS cost is a function of the number of documents to rank , the MC dropout samples and the number of selected samples . Since and are generally expected to be much smaller than , which is the size of the entire unlabeled set, and the total budget is predefined, then the total BAS cost is mostly dependent on .
In many practical applications we expect the unlabeled set to be rather large. This is a common pattern in all active learning methods, but in the case of BAS we need to take into account that modern abstractive summarization models have a high computational cost and we need to perform multiple forward passes for each input sample. With these considerations in mind we can see that the computational complexity of Algorithm 1 can become prohibitively high.
To address high computational complexity, we relax our assumption and instead of trying to find the documents with the highest uncertainty from the entire unlabeled pool we randomly sample a set consisting of documents from the unlabeled pool , where . Then in the labeling phase, we use MC dropout to generate summaries and estimate uncertainty only for the documents in . Once we select the top documents from , the remaining documents that were not selected are returned to the unlabeled pool and can be selected again at a later learning step. After this modification, the cost of BAS is no longer dependent on as can be seen in Equation 6. The adjusted BAS algorithm is shown in Algorithm 2.
To further illustrate the importance of the sampling step, we refer the reader to Table 2, which shows an experimental study over different sample sizes. We show the average computational time which can decomposed into scoring time () and training time (). It is clear that increasing the sample size leads to exponential increase for the scoring time, taking up a significant proportion of the total time and out-weights the training time. One can imagine that in the extreme case where , the computational time for each step will be very high even for medium size datasets.
With the introduction of this sampling step, we effectively stop worrying about finding the most uncertain samples in the pool, and instead we just look for samples of sufficiently high uncertainty. By limiting the number of documents participating in the labeling phase, we effectively reduce each steps’ computational complexity significantly. The parameter , which is the number of documents we sample at each step, controls the trade-off between precision and speed. High values guarantee the selected documents will have very high uncertainty, but also mean we will have to process a large number of documents during each steps’ labeling phase. On the other hand, low values mean only a few documents will be processed at each step, but would lead to the selected documents having lower uncertainty on average.
One additional benefit that comes with randomly selecting samples to rank, is that it allows us to sidestep a common active learning issue. Selecting the highest uncertainty samples from the entire dataset can easily result in acquiring a batch of very similar examples at each learning step. This problem is known to hurt performance and generalization for active learning methods [Kirsch2019BatchBALD:Learning].
In Section 7.1 we dive deeper into the impact of this trade-off between efficiency and performance, and we empirically explore the effects different values have.
5.2 Dealing with noisy data
Another practical implication that comes with selecting very high uncertainty samples is that we run the risk of them being noisy or of poor quality. As discussed in [Gidiotis2021Uncertainty-AwareSummarization], particularly in noisy datasets, documents with extremely high uncertainty could be problematic, and summarization performance could be improved if we eliminate them from the training set. Examples of problematic documents include ones that are malformed, written in a different language, or completely missing meaningful content.
Although the sampling strategy discussed in Section 5
helps with the problem of noisy samples, it does not address it completely. Such samples could be thought of as outliers in the uncertainty distribution and as such they should be removed from the training data. Especially when the data budget is low, keeping noisy samples could be detrimental to the overall performance, because they could end up accounting for a significant proportion of the training dataset.
We introduce a heuristic method that uses a simple uncertainty threshold to remove documents of disproportionately high uncertainty. This heuristic enables us to filter out most noisy documents and ensure the selected documents are of relatively good quality.
6 Experimental setup
In order to assess BAS’s effectiveness, an experimental study was conducted. Here, we describe our experimental setup including the data and model used for the study as well as the learning details. We aim to simulate a real-world scenario with a low data annotation budget and as a consequence all experimental decisions are made under that assumption.
XSum [Narayan2018DontSummarization] is a dataset of 227k news articles from BBC covering a wide variety of topics. Each article is paired with a human written, single-sentence summary. We used the XSum version openly available in the Hugging Face datasets repository111https://huggingface.co/datasets.
Since our emphasis is on small data budgets, we set the total data annotation budget to 900 samples and use the whole 200k article training set as the unlabeled pool. Each time a document is selected we retrieve its target summary and add it to the labeled set. Out of the total data budget, samples will be used for training and the rest samples will be left to the validation set, to be used for early stopping evaluation.
At the end of each learning step the generated models’ performance is evaluated on the full XSum test set consisting of 11k articles. This evaluation step is not part of the active summarization algorithm described in Section 4, and diverges from the low resource setup. We purposely made this decision in order to facilitate comparisons with existing methods on the XSum dataset and to verify the overall effectiveness of BAS.
is a Transformer based sequence-to-sequence summarization model achieving very strong performance on multiple well established summarization benchmarks. The models’ encoder and decoder are built of 16 Transformer blocks each. PEGASUS is pre-trained on the C4 and HugeNews datasets, on a sentence infilling task. We have used the open-sourced and pre-trained PEGASUS version found in the Hugging Face models repository222https://huggingface.co/models.
In order to perform Bayesian inference with PEGASUS, we follow the same approach used in [Gidiotis2021Uncertainty-AwareSummarization] by enabling dropout during inference for all Transformer blocks.
6.3 Learning details
Here we will present the learning details and parameters used when applying BAS in our experiments. We start off with the openly available, pre-trained PEGASUS model and we initially train it on randomly sampled and annotated documents. Then we proceed to run multiple learning steps as described in Section 4 until we exhaust our 800 samples training budget. At the end of each round, we evaluate the performance of the resulting model on the XSum test set. Each experiment was repeated three times with different random seeds in an effort to alleviate the effects of randomness, and all metrics were averaged across all three runs.
In the labeling phase, we use MC dropout with [Gidiotis2021Uncertainty-AwareSummarization] to compute the uncertainty of the model trained on the previous learning step. When generating summaries with the model, we use standard beam search with a beam size of 3, which is a good trade-off between performance and speed. In our main experiments we used with different values for . Also, when selecting samples for annotation, we ignore all documents with uncertainty higher than 0.96, as explained in Section 5.2. This heuristic threshold was determined after experimentation.
In the training phase, we start each time from the pre-trained PEGASUS model and train on the whole labeled set retrieved so far. In each step, the model is trained for 10 epochs at most, using early stopping with a patience of 4 epochs. We avoid extensively fine-tuning hyper-parameters because in a real world scenario, with limited data, extensive fine-tuning would result in severe overfitting. For the majority of the training hyper-parameters the values from the original PEGASUS paper [Zhang2020PEGASUS:Summarization] are used except for the batch size, which was set to 6 in order to keep resource requirements low. With this hyper-parameter setup we managed to run the entire learning on a single Nvidia T4 GPU.
The experimental results presented in this section are organized in the following way. First, we go through the process of tuning BAS, in an attempt to achieve a good balance between effectiveness and computational complexity. Then, we evaluate the performance of BAS and compare it with a baseline that follows the standard supervised learning paradigm of randomly selecting and annotating samples.
7.1 Tuning for performance and scalability
As described in Section 5.1, our method samples documents instead of calculating BLEUVar over the entire unlabeled set, which would be prohibitively expensive in many cases. Sampling allows us to greatly reduce BAS’s computational complexity, since scoring lots of documents with MC dropout is a very costly operation.
The number of sampled documents is a hyper-parameter that can significantly affect the performance and scalability of our method. We assess the effects of different values by running experiments with set to , , and . We then compare the outcomes of these experiments in terms of summarization performance and computational cost.
In general, smaller values of result in faster run times as shown in Table 2, but the samples selected at each step have lower uncertainty on average. Figure 1 illustrates the BLEUVar uncertainty scores distributions for different values. We observe that for higher values there is a significant shift in the distribution of the selected documents towards higher uncertainty values.
In Figure 2
we show the performance curves for ROUGE-1, ROUGE-2 and ROUGE-L F-score[Lin2004Rouge:Summaries], obtained with different values. Our first observation is that all curves converge to similar performance at the later learning stages but exhibit different behaviors at the early stages. We can see that increasing yields better performance early on, when the data budget is smaller, but the gains are not very significant towards the end.
The improvement in the performance curves can be attributed to the higher average uncertainty of the documents selected using higher values. As the average uncertainty of the selected samples gets lower, BAS selects samples closer to the full datasets’ mean giving up the advantages of active learning.
Based on our findings in this Section, we remark that using values smaller than 100 leads to a drop in performance and thus should be avoided . Also, increasing higher than 200 does not bring a convincing improvement in performance but increases computational costs considerably. We conclude that scoring - samples and selecting the ones with the highest uncertainty offers a good trade-off between performance and efficiency. For the rest of our experiments, we will be using and .
We plot the ROUGE-1, ROUGE-2 and ROUGE-L F-score performance for each learning step in Figures 3 () and 4 (), and compare the curves of Bayesian Active Summarization (BAS) against a baseline that uses random selection at each step. All curves are averaged across three runs.
In Table 3 we show the best performance achieved by the different approaches in terms of ROUGE. For the sake of comparison, we also show the performance achieved by PEGASUS when trained on the entire training set (PEGASUS full) with standard supervised learning333re-evaluated for consistency as well as the performance without any task specific training (PEGASUS pre-trained).
|Small (800 samples)||Very small (150 samples)|
Overall, BAS performs better than random selection as can be seen by the performance curves. More specifically, the performance curves of BAS-100 and BAS-200 are higher compared to random selection across all metrics. As the number of annotated samples increases we can see both BAS and random start converging to similar performance although BAS is still slightly higher than random selection. Also, in terms of ROUGE, BAS’s best performance is higher than random, with BAS-100 having a slight edge over BAS-200 on that aspect.
The biggest advantage of BAS is when the annotated dataset is very small, for example less than 150 samples, where it clearly outperforms random by a considerable margin. Also, for very small data budgets BAS-200 is better than BAS-100. Based on our experiments, with BAS-200 it is possible to get close to 95% of the performance of a PEGASUS model trained on the full XSum with less than 150 annotated samples. In comparison to BAS, we can see that with the same training budget, random selection achieves 93% of that performance, while only 39% of it is covered by the pre-trained model. Finally, using 5 times more data we can reach almost 97%, which is only a slight improvement if we take into account the significant increase in data annotation cost.
Being able to achieve good performance with such a small data budget is an extremely useful property in many real-world applications, since collecting and annotating even a few hundreds of training samples can be a challenge. Our findings in this experiment suggest that with BAS large summarization models such as PEGASUS could be applied effectively to solve few-shot problems.
Although the percentage difference between BAS and random selection is not particularly large, this is mainly due to the fact that PEGASUS is already a very effective pre-trained model. We argue that since this is a strong model, it is much harder to improve its performance by a large margin. Nevertheless, BAS is still a useful tool that allows us to get more value out of a small training budget.
we also plot the standard deviation over the three runs for each individual curve. We observe BAS exhibits smaller standard deviation across multiple runs compared to random selection, leading us to the conclusion that BAS would be more robust and less affected by stochastic factors. This observation is more clearly visible in Table3 where we show the standard deviations of the different approaches for all metrics. BAS-200 has significantly lower standard deviation compared to BAS-100, which suggests that increasing can lead to more robust solutions.
Robustness is particularly important in practical active learning setups, where getting extremely unlucky when selecting data samples to be annotated could lead to inferior performance. In practice, repeating the data selection process with a different random seed is not really an option due to the fixed budget, so it is crucial to have a method more likely to find a good solution.
In this work, we explored active learning in the context of abstractive text summarization. Although active learning methods have had significant impact on various NLP problems, applications to text summarization have been very limited. We introduced BAS, as a way of combining active learning methods with state-of-the-art summarization models. BAS is, to the best of our knowledge, the first attempt to apply Bayesian active learning to abstractive text summarization.
Our main findings suggest that indeed BAS can be an effective way for applying active learning to summarization, outperforming naive random selection in several ways. BAS achieves stronger performance and has better learning curves on small and very small annotation budgets compared to random selection. In addition, it leads to more robust learning as it reduces the risk of being extremely unlucky and selecting bad samples for annotation. At the same time, it allows for identifying and eliminating noisy samples that could hurt performance.
The main advantage of BAS, and active learning methods in general, is the ability to achieve strong performance with very few annotated data samples. As shown in our experiments, we managed to reach 95% of the performance of the fully trained PEGASUS model, using less than 150 training samples. This finding can have a significant impact in many real-world applications where collecting large datasets is very costly.
In addition, we performed an experimental analysis of different BAS setups in an attempt to better understand the effect of different design decisions with regards to performance and computational efficiency. We found that selecting the most uncertain documents from a small sub-sample of the full dataset yields satisfactory performance and scales well.
Although our findings suggest Bayesian Active Learning is a promising approach for improved abstractive summarization, we are barely scratching the surface of this interesting and very little explored topic. We hope our work will spark a broader discussion and will be a starting point for further exploring active learning methods on the task of text summarization.