Nowadays, researchers have been increasingly tasked by funders and publishers to outline their research for the public by writing a lay summary. Therefore, it is essential to automatically generate lay summaries to reduce the workload for researchers as well as build a bridge between the public and science. Previous studies have investigated scientific article summarization especially for papers cohan2018discourse; lev2019talksumm; yasunaga2019scisummnet. However, less work has been done to generate lay summaries.
Recently, the First Workshop on Scholarly Document Processing Chandrasekaran2020Overview, Lay Summary Task111https://ornlcda.github.io/SDProc/index.html (LaySumm 2020) first proposed the task of Lay Summary Generation. The task aims to generate summaries that are representative of the content, comprehensible and interesting to a lay audience. After checking the dataset that the task provides, we observe that lots of the sentences in lay summaries have corresponding sentences in original papers. Inspiring by this observation, we think that making binary sentence labels for extractive summarization and utilize them as extra supervision signals can help model generate better summaries. Therefore, we conduct BART lewis2019bart encoder to make sentence representations and train extractive summarization together with abstractive summarization.
Experimental results show that leveraging sentence labels can improve the Lay summary generation performance. In the Laysumm 2020 competition, our model achieves 46.00% Rouge1-F1 score. The code will be released on Github 222https://github.com/TysonYu/Laysumm.
2 Related Work
Text summarization aims to produce a condensed representation of input text that captures the core meaning of the original text. Recently, neural network-based approaches have reached remarkable performance for news articles summarizationsee2017get; liu2019text; zhang2019pegasus. Comparing with news articles, scientific papers are typically longer and contain more complex concepts and technical terms.
Scientific Paper Summarization
Existing approaches for scientific paper summarization include extractive models that perform sentence selection qazvinian2013generating; cohan2017scientific; cohan2018scientific and hybrid models that select the salient text first and then summarize it subramanian2019extractive. Besides, cohan2018discourse built the first model for abstractive summarization of single, longer-form documents (e.g., research papers).
In order to train neural models for this task, several datasets have been introduced. The arXiv and PubMed datasets cohan2018discourse were created using open access articles from the corresponding popular repositories. yasunaga2019scisummnet developed and released the first large-scale manually-annotated corpus for scientific papers (on computational linguistics).
Large Pre-trained Language Model
Large pre-trained language models, such as BERT devlin2018bert, UniLM dong2019unified and BART lewis2019bart have shown great performance on a variety of downstream tasks including summarization. For example, BART achieved state-of-the-art performance on CNN/DM hermann2015teaching news summarization dataset.
We use two datasets for this work, which are the dataset of CL-LaySumm 2020 and ScisummNet yasunaga2019scisummnet. In this section, we introduce the details of them and the pre-processing method we used.
3.1 CL-LaySumm 2020 Dataset
The CL-LaySumm 2020 Dataset is released by the CL-LaySumm Shared Task that aims to produce lay summaries of scientific texts. A lay summary refers to a textual summary intended for a non-technical audience. There are 572 samples in the dataset for training and each sample contains a full-text paper with a lay summary. To test the summarization model, we need to generate lay summaries for 37 papers within 150 words.
Since the original papers are very long and the task requires us to generate relatively short summaries, it is crucial to extract important parts of papers first before feeding them to large pre-trained models. Given our own experience of how papers are written, we start with the assumption that the Abstract, Introduction and Conclusion are most likely to convey the topic and the contributions of the paper. So, we make different combinations of these three sections as input to our model.
3.2 ScisummNet Dataset
The ScisummNet is the first large-scale, human-annotated Scisumm dataset. The dataset provides 1009 papers with their citation networks as well as their manual summaries. The gold summaries are written by annotators based on the abstract and selected citation sentences that also convey the contributions of papers. We take the abstract and annotators selected citation sentences as our models’ input.
3.3 Data Pre-processing
As mentioned above, we first represent the document using the sentences in its Abstract, Introduction and Conclusion. Then we use two approaches to pre-process the text.
The first pre-processing approach is removing tags and outliers. The original text of the Laysumm dataset has lots of tags such as TITLE, SECTION and PARAGRAPH. We remove all different kinds of tags. Besides, some samples of the Laysumm dataset do not contain an Abstract or Introduction. We regard these samples as outliers and delete them while training the model. The total number of outliers is 23. Then, we truncate all input text to a max length of 1024 tokens due to the carrying capacity of the BART model.
|BART (Data augmentation)||0.4490||0.4887||0.1972||0.2136||0.2895||0.3139|
|BART + Two-stage||0.4529||0.4882||0.2067||0.2224||0.2929||0.3140|
|BART + Multi-label||0.4600||0.5013||0.2070||0.2223||0.2876||0.3104|
We use BART, a denoising autoencoder for pretraining sequence-to-sequence modelslewis2019bart as our baseline.
BART is based on the standard Transformer model vaswani2017attention, which can be regarded as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder). It is pre-trained on the same corpus as RoBERTa liu2019roberta with two tasks: text infilling and sentence permutation. For text infilling, 30% of tokens in each document are masked and the model is trained to recover them at the output. For the sentence permutation, all sentences are permuted as input and the model is supposed to generate the output sentences with the correct order.
BART obtains great performance on the summarization task. We use the BART fine-tuned on CNN/DailyMail dataset hermann2015teaching to initialize our model.
4.2 Multi-Label Summarization Model
There are two canonical strategies for summarization: extractive summarization, which concatenates sentences into the summary and abstractive summarization, which generate novel sentences for the summary. Inspired by the observation that lots of the sentences in human written lay summaries have corresponding sentences in original papers, we use an unsupervised approach to convert the abstractive summaries to extractive labels and train abstractive summarization together with extractive summarization.
To make the ground truth sentence-level binary labels for extractive summarization, which we call ORACLE, we use a greedy algorithm introduced by nallapati2017summarunner. The approach is based on the idea that the selected sentences from the input should be the ones that maximize the Rouge score lin2003automatic with the respect gold summary.
The architecture of our model is shown in Figure 1, which follows the BART model’s structure. The input document is fed into the bidirectional encoder, then the contextual embeddings of the [CLS] symbol are used as the sentence representations. After a feedforward neural network, these sentence representations produce a binary distribution about whether they belong to the extractive summary. As for the abstractive summary, it is generated by the autoregressive decoder. The overall loss is calculated by . Here and refer to the Cross-Entropy loss of extractive and abstractive summary respectively.
4.3 Data Augmentation
Data augmentation has been an effective technique to create new training instances when the training data is not enough, as demonstrated in computer vision as well as for many NLP taskschen2017reading; yang2019data; yuan2017machine.
Existing data augmentation approaches in NLP tasks can be categorized into retrieval-based methods chen2017reading; yang2019data and generation-based methods yuan2017machine; buck2017ask. However, none of these suits our situation, since external sources or auxiliary training data are still required. So we adopted a similar method from nema2017diversity
. A pre-defined vocabulary of 24,822 words was used where each word had been associated with a synonym. Then for each training instance, certain ratios (in our case, 1/9) in each document were randomly selected (except stop words and numerical values) and then replaced with their synonyms found in the vocabulary. If a selected word was not found in the vocabulary, it was added there with the most similar word found based on cosine similarity in the GloVepennington2014glove vocabulary. For each training instance, this process is repeated 9 times to create 9 new documents. But the same summary of the original instance was used in the newly generated instances.
4.4 Two-Stage Fine-tuning
To make use of the ScisummNet dataset, we conduct a two-stage fine-tuning method. In the first stage, we fine-tune the pre-trained BART model on the ScisummNet dataset. We use the Abstract and annotators selected citation sentences as the input and the gold summary as the output. The model is fine-tuned with 20000 iterations before saved. As for the second stage, we use the same settings as we directly fine-tune on the CL-LaySumm 2020 dataset.
During the training phase, we randomly select 90% of the CL-LaySumm 2020 Dataset for training and 10% for validation. If a data sample doesn’t contain an Abstract or Introduction, we don’t include it in training or validation. To find the optimal architecture for this task within the models we have, we set up seven different experiments.
BART (Abs): We only use the Abstract as the input to the BART model.
BART (Abs+Intro): We use the Abstract and the first paragraph of the Introduction as the input to the BART model.
BART (Abs+Intro): We use the Abstract and the whole Introduction as the input to the BART model.
BART (Abs+Intro+Con): We use the Abstract, the first paragraph of the Introduction, and the Conclusion (if the paper has) as the input to the BART model.
BART (Data augmentation): We use the same data as BART (Abs+Intro+Con). For each training sample, we create 9 new input documents by synonym data augmentation.
BART + Two-stage: We use the same data as BART (Abs+Intro+Con) to the BART model. The two-stage fine-tuning method is introduced in Section 4.4
BART + Multi-label: We use the same data as BART (Abs+Intro+Con). In addition, for each sentence in the input, we add [CLS] token at the beginning.
As for the hyperparameters, we use a dynamic learning rate, warm up 1000 iterations, and decay afterward. We set the batch size to 1 because of the limitation of GPU memory. The gradient will accumulate every ten iterations and we train all models for 6000 iterations on 1 GPU (GTX 1080 Ti). We save the best model that has the highest Rouge1-F1 score based on the validation set. For the BART model, we use the implementation from the huggingface333https://github.com/huggingface/transformers. We use the BART large model pre-trained on CNN/DailyMail dataset.
6 Result Analysis
Different inputs to the model.
The experiment results of BART (Abs), BART (Abs+Intro), and BART (Abs+Intro+Con) show by adding the Introduction and Conclusion to the input, the models’ performance improves consistently. However, comparing with the results from BART (Abs+Intro) and BART (Abs+Intro), using the whole Introduction rather than the first paragraph of the Introduction decreases the performance on Rouge1 score. We think it is because the CL-LaySumm 2020 task requires to make a relatively short summary, less than 150 words. If the input is too long, it makes the model harder to summarize because longer input contains more noisy data. Since the CL-LaySumm 2020 dataset is also small, the model doesn’t have enough samples to learn the task.
Two-stage fine-tuning and Data Augmentation.
The experimental results show that two-stage fine-tuning doesn’t help to improve the model’s performance. After checking the details of ScisummNet, we find the corpus comes from ACL Anthology Network (AAN) radev2013acl, which means all data relates to computational linguistics. In contrast, the CL-LaySumm 2020 dataset use papers from a variety of domains including biology and medicine. The Statistical differences between these two datasets make the model hard to learn prior knowledge that can be utilized in CL-LaySumm 2020 task.
As for the Data Augmentation, the model performance also doesn’t increase as we expected, which contradicts the results from the original paper nema2017diversity. However, the same method also fails in laskar2020query, which also adopted a large pre-trained model as a start-point for fine-tuning. So we think the possible reason might be that large pre-trained models are less robust to noisy input. Our synonyms replacement method is too simple as well as unsupervised. On one hand, it can increase the vocabulary diversity of the training data without changing the semantic meaning a lot, but on the other hand, the quality especially the grammar of the generated instances can not be guaranteed to be correct. Thus, some noise might be introduced and decreases the model performance when we augment the data.
Comparing with BART (Abs+Intro+Con) and BART + Multi-label models, we find that with multi labels, the Rouge1-F1 score is better but the Recall score is lower, which means that the precision increase a lot. We think that with the extra supervision of sentence labels, the model can learn a better sentence understanding. As a result, the model is able to extract important content from the input which helps upper the F1 and Precision scores.
In this paper, we showcased how different inputs, data augmentation, training strategy, and sentence labels influence the lay summarization task. We introduce a new method to utilize sentence labels as another supervision signal while training BART based model. Experimental results show our models can generate better summaries evaluated by the Rouge1-F1 score.
Appendix A Case Study
a.1 The Lay Summary of this Paper
In the CL-LaySumm 2020 shared task, our model achieves 46.00% Rouge1-F1 score. In this paper, we build a lay summary generation system based on the BART model. We leverage sentence labels as extra supervision signals to improve the performance of lay summarization. Experimental results show that leveraging sentence labels can improve the Lay summary generation performance. The code will be released on Github.
The summary above is generated by our own system with Abstract, Introduction and Conclusion from this paper. Although many sentences are copied from the original text, they are well organized and coherent. Besides, the content of the summary also conveys the topic and the contribution of this paper. In conclusion, our system can produce accurate and readable summaries.