Training complex neural models usually requires large amounts of data. However, data annotation still poses a challenge for many domains. Thus, much research focuses on data manipulation (e.g., synthesis, augmentation) and additional ways to handle data differently during training. Prior work on the synthesis of textual data has focused on back translation Parida and Motlicek (2019); Wang et al. (2018); Sennrich et al. (2016) and word replacement Wang and Yang (2015); Zhang et al. (2015). We, on the other hand, propose a different approach to data synthesis through paraphrasing. However, synthesis involves data manipulation on the input level, which might expose the model to grammatically or logically incorrect input. Thus, we explore a second approach to data manipulation based on augmentation rather than synthesis. Augmentation aims to move data manipulation from the input side to any part of the model. This, in turn, can help the model be more resilient to over-fitting. Third, we explore using data more efficiently by integrating curriculum learning into the training process. Curriculum learning reorders training samples based on external criteria, which can help train the model gradually and more efficiently without the need for any external data. We also introduce new difficulty metrics based on specificity and abstractiveness for curriculum construction. Finally, we explore combining multiple techniques (synthesis and curriculum) to overcome the data scarcity issue. Thus, our contribution is threefold: 1) We introduce a simple approach for data synthesis through paraphrasing. 2) We use data augmentation by sample mixing to move augmentation into the model. 3) We integrate a curriculum into the training process and introduce two new difficulty metrics.
2 Related Work
Abstractive summarization for low resource data. Prior proposed methods for tackling domains with scarce data have included finetuning pre-trained models Bajaj et al. (2021); Yu et al. (2021); Magooda and Litman (2020) such as BART Lewis et al. (2020) or using few-shot learning Bražinskas et al. (2020); Sarkhel et al. (2020). Our work differs in several aspects. First, our work doesn’t focus on improving a certain summarization model; in contrast, we focus on using data efficiently, which can be applied to various models. Second, we focus on techniques that can improve the training process without additional data, e.g., synthesis, augmentation, and curriculum learning.
Data synthesis and augmentation
. Data synthesis for text summarization is underexplored, with only a few approaches such as back-generationParida and Motlicek (2019) and template-based summary re-writing Magooda and Litman (2020). We propose doing data synthesis by paraphrasing, which is simpler than the back-translation and template methods. While combining synthesis with paraphrasing has been studied in other contexts Wang et al. (2015); Iyyer et al. (2018), our work differs in both goals and techniques. Wang et al. (2015) proposed synthesizing data, then crowdsourcing paraphrases to train semantic parsers, while Iyyer et al. (2018) synthesized data to train a paraphrasing model. Our work, to our knowledge, is the first to use a strong language model finetuned for paraphrasing to synthesize data for text summarization. Finally, for data augmentation, we base our work on the MixText approach Chen et al. (2020). While the original MixText model is used for classification-based tasks, we introduce a variation for generative tasks (called MixGen) and use it for abstractive summarization.
Curriculum learning. Curriculum learning aims to improve the training procedure with the same amount of data. It has been applied in NLP Sachan and Xing (2016, 2018); Tay et al. (2019); Xu et al. (2020); Wang et al. (2020) for machine comprehension, question generation, reading comprehension, NLU and machine translation, respectively. We build on the approach introduced in Xu et al. (2020); however, the core differences are both the downstream tasks (classification versus abstractive summarization) and the difficulty metrics. In contrast to the only other summarization work that we know of, Kano et al. (2021) focus on large datasets, while we focus on low resource domains. We also introduce two different difficulty metrics (ROUGE and specificity).
3 Summarization Datasets
CourseMirror (CM) is a student reflections dataset that has been used in prior work to study extractive Luo and Litman (2015) and abstractive Magooda and Litman (2020)111https://petal-cs-pitt.github.io/data.html summarization. The dataset consists of documents and summaries from four courses.
|Data||# docs||# refs||Train||Val||Test|
Table 1 summarizes the dataset in terms of the number of documents (# docs) and average reflections per document (# refs). We compiled all courses into one dataset (named CM ALL), then split the documents into training, validation, and test sets (80%, 10%, 10%, respectively) by sampling equally from all courses.
Amazon/Yelp (A/Y) is another small dataset, now consisting of opinions (refer to the appendix for examples) Bražinskas et al. (2020)222https://github.com/abrazinskas/FewSum. The dataset contains customer reviews from Amazon He and McAuley (2016) and Yelp. The data contains 160 products/businesses split into training, validation and test sets as shown in Table 1. Each of the products/businesses contains a set of 8 reviews.
4 Proposed Model
4.1 Synthesis via paraphrasing with GPT-2
Influenced by work in style transfer Krishna et al. (2020), we propose synthesizing new human summaries by using paraphrasing to generate other potential summaries that are paraphrases of the original human summary. We use the paraphraser trained by Krishna et al. (2020)
. They finetuned a large GPT-2 language model with data from PARANMT-50MWieting and Gimpel (2018) to direct the model into generating diverse paraphrases that they later used for style transfer.
4.2 Augmentation with sample mixing
MixText is a data augmentation approach based on mixing two input samples by weight summing the features corresponding to the two samples at any level of the model (specific layer of the encoder, after encoder, etc.) using
. The model is then expected to produce a probability distribution over the available classes, similar to aweighted sum of the two samples’ gold predictions. We train the model using KL divergence between the predicted distribution and the expected one. We adapt the approach of Chen et al. (2020) for text generation tasks by modifying the decoding process and loss calculation; we call our approach MixGEN. Like the original MixText, we use two input samples and pass them to the encoder. We pass the samples up to a specific layer, then the two hidden states are summed together weighted differently using the parameter. On the decoder side, first, we construct the expected values using the following:
where = vocab size, and , are the human summary of input sample 1 and input sample 2, respectively.
is the probability distribution expected for token. , are the token of the first sample’s and second sample’s human summary, respectively. In simple wording, during decoding, we expect the output probability distribution across the vocabulary of the decoder at token position to have two high values, one with value at vocabulary token corresponding to the token of the first sample’s human summary, and another value of at vocabulary token corresponding to the token of the second sample’s human summary. This should continue as long as is less than or equal to both human summaries’ length. Once is greater than the minimum summaries’ length, then the expected distribution would only correspond to the longer summary. Finally, the text generation on the decoder side is auto-regressive, thus, expected token is passed at the end of each generation step. However, If we pass the argmax of the expected distribution, then we will end up always passing the token corresponding to the sample with higher weight (Alpha) vs. (1-Alpha). Thus, we randomly sample from the two input samples based on their weights using the following equation.
where and are the first and second sample respectively, is minimum length of both and . is the sampling probability and
is a uniform distribution.
4.3 Curriculum learning (Cur.)
Curriculum learning aims to help the model training process by introducing easier samples first followed by more difficult ones according to a particular difficulty metric. We use the curriculum construction approach introduced in Xu et al. (2020). In this approach, we split data into buckets based on a difficulty metric. We then train the model in a difficulty incremental setting. In this work, we use two different curriculum difficulty metrics.
Specificity (S) measures how specific or vague a piece of text is. We argue that the more specific a piece of text is, the more complicated it can get. For example, text like (Nothing, Everything is Easy, etc.) are not specific and easy for the model to learn and vice versa. We feed the model less specific pieces of text first during training, then introduce the more specific ones as training progresses. Specificity is calculated (Appendix) on the reflection/review level, so we use the average values of the whole set of reflections/reviews as the document value. For example; for a training sample of an input document consists of independent reflections/reviews , we calculate the specificity value for the sample as follows:
where is specificity value of the i’s reflection/review.
ROUGE (R) is the standard metric for evaluating summarization performance. Thus, we decided to use ROUGE scores as a difficulty metric. For a training sample, we calculate different ROUGE scores between the input document and its corresponding human summary , then use average of (R1, R2, RL) as the difficulty metric. According to Liu et al. (2018), the higher the ROUGE score, the less abstractive the summary is compared to the input, and vice versa. We argue that the more abstractive samples are harder to learn.
Baselines: To our knowledge, in prior work there is no data synthesis technique used for summarization except back generation Parida and Motlicek (2019) and template synthesis Magooda and Litman (2020). Thus, we developed two synthesis baselines (shuffle; shuffle + mask). We generated 10 samples for each of the original training samples for both baselines by randomly shuffling the reflections/reviews. Additionally, for the shuffle-mask baseline, we randomly mask 50% of the reflection/reviews 50% of the time.
Paraphrasing with GPT-2: We generate synthetic samples for each original sample by generating paraphrases of the human summary and shuffle the input reflections. We varied N between [5, 10] to monitor the effect of synthetic data size.
MixGEN: We integrate MixGEN by combining each sample with other samples during training. We used =3 for our experiments. Moreover, we use mixing probability =0.75 as specified by the original code implementation333https://github.com/GT-SALT/MixText.
Curriculum learning: In the curriculum learning experiments, we use a specificity prediction model that consists of a DistilBERT Sanh et al. (2019)
encoder with a logistic regression classification layer (Appendix). We normalize the whole training data values between 1 and, where is the number of buckets to split the data. We use =10. Similarly, we normalize the average ROUGE value to also be between 1 and .
5.2 Model Training
In all of our experiments, we use the BERTSum444Easy to use and one of the SOTA summarization models.
https://github.com/nlpyang/PreSumm model proposed by Liu and Lapata (2019). We used the same parameters in the original code (Appendix). We conducted experiments on CM and A/Y datasets using proposed methods in a regular training and in a (pretrainingfine-tuning) setting, where we perform pretraining with synthesized data and fine-tuneing using original data.
|With synthetic data pretraining|
|With synthetic data pretraining|
Tables 2 and 3 show results obtained through conducting experiments on CM and A/Y datasets. Considering data synthesis and augmentation, we first see that the two baselines (shuffle and shuffle+mask) can improve performance compared to no data manipulation across all ROUGE scores except RL for shuffle baseline on the A/Y dataset. This shows that reducing the model dependency on the input sentence order can help the model depend more on the actual input text. Moving to the proposed augmentation technique (MixGEN), we see that we can get a performance gain across all ROUGE scores across both datasets by mixing training samples compared to normal training with a single sample. Similarly, we can see that providing synthetic data with the proposed paraphrasing approach can help outperform both using original data as well as baselines with (41.14, 14.24, 26.98) compared to (36.34, 11.39, 26) and (38.57, 11.72, 26.94) for original and shuffle baseline respectively on CM, and (28.49, 4.54, 18.08) compared to (27.71, 3.83, 17.83) and (28.34, 4.04, 17.74) on A/Y. Additionally, we can see that increasing the synthetic data size helps to improve the model performance across all ROUGE scores for both CM and A/Y datasets (N=5 vs. N=10).
Now moving to curriculum learning, we can see that integrating a curriculum to reorder training data differently using any of the two proposed difficulty metrics can lead to consistent improvements across all ROUGE scores for both CM and A/Y datasets. Additionally, we can see that curriculum can improve scores compared to the two augmentation baselines across all ROUGE scores except R1 for CM data. On the other hand, we don’t see consistent ROUGE score improvement when using curriculum for fine-tuning after pretraining with synthetic data. We hypothesize that this behavior might be due to performing the pretraining phase without curriculum integration, unlike the fine-tuning phase. We plan to conduct experiments with a curriculum integrated into both pretraining and fine-tuning to validate our hypothesis. Furthermore, while both curriculum difficulty metrics (i.e., Specificity and ROUGE) introduced improvement compared to training with no curriculum, we didn’t observe any consistent improvement pattern in using one metric over the other.
7 Conclusion and Future work
In this work, we showed that we could mitigate the effect of data scarcity in different datasets (i.e., CourseMirror and Amazon/Yelp) for abstractive summarization using three simple data manipulation techniques. We showed that synthesizing data with paraphrasing to use for pretraining can boost the model performance across all ROUGE scores for different datasets. Additionally, we showed that mixing samples for training can also push the model to be more resilient to overfitting and improve its performance. Finally, we showed that reordering training samples through curriculum, using the proposed difficulty metrics (i.e., Specificity, and ROUGE) would help improve all ROUGE scores across different datasets without the need for any additional data (either true or synthetic). In the future, we plan to try more values for synthesis and MixGen. Additionally, we plan to investigate other curriculum difficulty metrics. We plan to use BART model as one of the SOTA models for abstractive summarization. Finally, we are doing additional experiments on multitask learning, and we plan to combine both techniques in one framework targeting low resource domains.
The research reported here was supported, in whole or in part, by the institute of Education Sciences, U.S. Department of Education, through Grant R305A180477 to the University of Pittsburgh. The opinions expressed are those of the authors and do not represent the views of the institute or the U.S. Department of Education. We like to thank the Pitt PETAL group and the anonymous reviewers for advice in improving this paper.
Long document summarization in a low resource setting using pretrained language models.
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, Online, pp. 71–80. External Links: Cited by: §2.
- Few-shot learning for opinion summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 4119–4135. External Links: Cited by: §2, §3.
MixText: linguistically-informed interpolation of hidden space for semi-supervised text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 2147–2157. External Links: Cited by: §2, §4.2.
- Scaling reflection prompts in large classrooms via mobile interfaces and natural language processing. In Proceedings of the 22nd International Conference on Intelligent User Interfaces, pp. 363–374. Cited by: §C.1.
- Ups and downs: modeling the visual evolution of fashion trends with one-class collaborative filtering. In Proceedings of the 25th International Conference on World Wide Web, WWW 2016, Montreal, Canada, April 11 - 15, 2016, J. Bourdeau, J. Hendler, R. Nkambou, I. Horrocks, and B. Y. Zhao (Eds.), pp. 507–517. External Links: Cited by: §3.
- Adversarial example generation with syntactically controlled paraphrase networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1875–1885. External Links: Cited by: §2.
- Quantifying appropriateness of summarization data for curriculum learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, pp. 1395–1405. External Links: Cited by: §2.
- Reformulating unsupervised style transfer as paraphrase generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 737–762. External Links: Cited by: §4.1.
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7871–7880. External Links: Cited by: §2.
- Generating wikipedia by summarizing long sequences. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Cited by: §4.3.
- Text summarization with pretrained encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 3730–3740. External Links: Cited by: §5.2.
- Summarizing student responses to reflection prompts. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1955–1960. External Links: Cited by: §3.
- Abstractive summarization for low resource data using domain transfer and data synthesis. In The Thirty-Third International Flairs Conference, Cited by: §2, §2, §3, §5.1.
- Abstract text summarization: a low resource challenge. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 5994–5998. External Links: Cited by: §1, §2, §5.1.
- Easy questions first? a case study on curriculum learning for question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 453–463. External Links: Cited by: §2.
- Self-training for jointly learning to ask and answer questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 629–640. External Links: Cited by: §2.
- DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv preprint abs/1910.01108. External Links: Cited by: §5.1.
- Interpretable multi-headed attention for abstractive summarization at controllable lengths. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 6871–6882. External Links: Cited by: §2.
Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 86–96. External Links: Cited by: §1.
- Simple and effective curriculum pointer-generator networks for reading comprehension over long narratives. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4922–4931. External Links: Cited by: §2.
- Learning a multi-domain curriculum for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7711–7723. External Links: Cited by: §2.
- That’s so annoying!!!: a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 2557–2563. External Links: Cited by: §1.
- SwitchOut: an efficient data augmentation algorithm for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 856–861. External Links: Cited by: §1.
- Building a semantic parser overnight. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, pp. 1332–1342. External Links: Cited by: §2.
- ParaNMT-50M: pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 451–462. External Links: Cited by: §4.1.
- Curriculum learning for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 6095–6104. External Links: Cited by: §2, §4.3.
- AdaptSum: towards low-resource domain adaptation for abstractive summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 5892–5904. External Links: Cited by: §2.
- Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 649–657. External Links: Cited by: §1.
Appendix A BERTSum training parameters
We train the model for 200K steps using a batch size of 140. We use 20K steps for BERT warmup, 10K steps for decoder warmup, and a max position of 512. We use 4 Nvidia Quadro RTX 5000 GPUs. We use the checkpoint with highest ROUGE score on validation set for testing.
Appendix B MixGen Model
Figure 1 shows both the original MixText model and the modified MixText for generative tasks (MixGen).
Appendix C Specificity Model
CourseMirror data is also annotated for specificity. The data contains human annotations for around 7000 reflections555https://petal-cs-pitt.github.io/data.html using the scheme introduced in Fan et al. (2017). Table 4 shows the score distribution for CourseMirror specificity dataset. We use the data to train a specificity predicition model. We use the model to predict the specificity values for both CourseMirror and Amazon/Yelp datasets. The specificity prediction model (figure 2
) uses DistilBERT encoder to produce reflection embedding, the embeddings are then used as features to train a logistic regression classifier. To keep the number of tuned parameters to minimum, the DistilBERT weights are frozen during the training process. The embeddings are used as fixed features, and all the training is performed on the logistic classifier side.
Appendix D Data samples
d.1 CourseMirror (CM)
Table 5 shows an example of CM sample from CS course.
|Point of Interest (POI): Describe what you found most interesting in today’s class.|
|Student Reflection Document|
|• the dynamic bag|
|• I found the creation of the Bag to be the most interesting.|
|• Learning about bags was very interesting.|
|• Dr. Ramirez cleared up my understanding of how they should work.|
|• I was really interested in learning all about an entirely new data structure , the Bag.|
|• I ’m also noticing that as these classes get farther along , there is more focus on real world factors that determine strength of code like speed|
|• The bag concept was cool how basically acts like a bag in real life with its usefulness.|
|• Bags as a data type and how flexible they are.|
|• Discussing the Assignment 1|
|• I found the examples and drawings the teacher drew on the whiteboard the most interesting.|
|• Abstraction, though seemingly intimidating is kind of just giving programmers a break right?|
|• We ’re given so many more abilities and operations without having to know exactly how to code that.|
|• That being said , while I understand the applications being explained to me , it ’s hard to just manifest that on my own.|
|• Learning about resizing Bags dynamically|
|• The discussion of the underlying methods of ADTs such as bags was most interesting|
|• the implementation of an array bag|
|• Order does not matter when using a bag.|
|• It is important to keep all of the values in an array together.|
|• To do this , you should move an existing element into the vacant spot.|
|• Looking at ADT ’s from both perspectives|
|• Information held in bags is not in any particular order|
|• different ways to implement the bag|
|• Thinking about a more general idea of coding with ADTs and starting to dig into data structures more specifically.|
|• Code examples of key concepts/methods is always helpful.|
|• I thought it was a good thing to go through the implementation of both the add ( ) and remove ( ) methods of the Bag ADT|
|• Today we were talking about a certain type of ADT called a bag.|
|• We talked about certain ways that we would implement the methods and certain special cases that we as programmers have to be aware of.|
|• If you were removing items from ADT bag , you can simply shift the bottom or last item and put it in the place where you we removed an item.|
|• This is because , in bags , order does not matter.|
|• Learning about managing arrays in a data structure|
|• The bag ADT and how it is implemented|
|Reference Abstractive Summary|
|Students were interested in ADT Bag, and also its array implementation. Many recognized that it should be resizable, and that the underlying array organization should support that. Others saw that order does not matter in bags. Some thought methods that the bag provides were interesting.|
Table 6 shows an example of sample from amazon/Yelp data.
|This pendant is so unique!! The design is beautiful and the bail is a ring instead of the typical bail which gives it a nice touch!! All the corners are smooth and my daughter loves it - looks great on her.I cannot say anything about the chain because used our own chain.:) Satisfied.|
|It look perfect in a womens neck!! great gift, I thought for the price it was going to look cheap, but I was far wrong. It look great.Spect great reward from your woman when you give this to her; D|
|The prettiest sterling silver piece I own now. I get so many compliments on this necklace. I bought it for myself from my hubby for Valentine’s Day. Why not? When people ask where I got it, I simply say from my loving hubby. And he is off the hook as to what to get me. win + win.|
|I love hearts and I love ’love’:) I do not have any negative feedback, the necklace is perfect and the charm is perfect. I just thought it would have been slightly bigger. Overall, I love my new heart necklace.|
|When I received the package, I was surprised and amazed because the necklace is so elegant, beautiful and the same as the picture shown here. I really love this necklace. It has a unique pendant designed. I will recommend it to someone to order it now…|
|Item is nice. Not a great quality item, but right for the price. Charm was larger than I expected (I expected small and elegant, but it was large and almost costume jewelry like). I think it is a good necklace, just not what I expected.|
|I got this as a present for my GF on Valintines day. She loves it and wears it every day! Its not cheap looking and it hasn’t broken yet. The chain hasn’t broken either even though it is very thin. Strongly recomend it!|
|Over all service has been great the only problem, I ordered a purple Mickey Mouse case for iPhone 4S they sent a black, n I felt it was to much trouble n such a small item to send back so needless to say its put back in a drawer somewhere|
|This silver chain and pendant are elegant and unique. The necklace is very well made, making it a great buy for the cost, and is of high enough quality to be worn every day. The necklace looks beautiful when worn bringing many compliments. Overall, it is highly recommended.|