Having informative summaries of scientific articles is crucial for dealing with the avalanche of academic publications in our times. Such summaries would allow researchers to quickly and accurately screen retrieved articles for relevance to their interests. More importantly, such summaries would lead to high quality indexing of the articles by (academic) search engines, leading to more relevant academic search results.
Currently, the role of such summaries is played by the abstracts produced by the authors of the articles. However, authors usually include in the abstract only the contributions and information of the paper that they consider important and ignore others that might be equally important to the scientific community (Elkiss et al., 2008).
A solution to the above problem would be to employ state-of-the-art summarization approaches (Nallapati et al., 2016; See et al., 2017; Chopra et al., 2016; Rush et al., 2015; Paulus et al., 2017), in order to automatically create short informative summaries of the articles to replace and/or accompany author abstracts for machine indexing and human inspection. These approaches, however, have focused on the summarization of newswire articles, while academic articles exhibit several differences and pose major challenges compared to news articles.
First of all, news articles are much shorter than scientific articles and the news headlines that serve as summaries are much shorter than scientific abstracts. Secondly, scientific articles usually include several different key points that are scattered throughout the paper and need to be accurately included in a summary. These problems make it difficult to use summarization models that achieve state-of-the-art performance on newswire datasets for the summarization of academic articles.
We propose SUSIE (StrUctured SummarIzEr), a novel training method that allows us to effectively train existing summarization models on academic articles that have structured abstracts. Our method uses the XML structure of the articles and abstracts in order to split each article into multiple training examples and train summarization models that learn to summarize each section separately. We call such a task structured summarization. We further contribute a novel dataset consisting of open access PubMed Central articles along with their structured abstracts. SUSIE can easily be combined with different summarization models in order to address the problem of long articles and has been found to improve the performance of state-of-the-art summarization models by 4 ROUGE points.
2 Related Work
2.1 Summarization Methods
State-of-the-art summarization methods use recurrent neural networks (RNNs) with the encoder-decoder architecture (or sequence-to-sequence architecture). These methods usually treat the whole source text as an input sequence, encode it into their hidden state and generate a complete summary from that hidden state.
Strong results have been achieved by such models when combined with an attention mechanism (Nallapati et al., 2016; Chopra et al., 2016; Rush et al., 2015). Adding a pointer-generator mechanism has been shown to further improve results (See et al., 2017). The pointer-generator mechanism gives the model the ability to copy important words from the source text in addition to generating words from a predefined vocabulary. Adding a coverage mechanism has been shown to lead to even better results. (See et al., 2017). The coverage mechanism prevents the model from repeating itself, which is a common problem with sequence-to-sequence models.
2.2 Summarization Datasets
Most of the summarization datasets that are found in the literature, such as Newsroom (Grusky et al., 2018), Gigaword (Napoles et al., 2012) and CNN / Daily Mail (Hermann et al., 2015), are focused on newswire articles. The average article lengths are relatively small and range from 50 words (Gigaword) to a few hundred words (CNN / Daily Mail, Newsroom). The average summary lengths are also rather small and range from a single sentence (Gigaword, Newsroom) to a few sentences (CNN / Daily Mail).
TAC 2014 (Text Analysis Conference 2014) is the only dataset that focuses on (biomedical) academic articles. The articles have an average of 9,759 words and the summaries an average of 235 words. However, as it consists of just 20 articles, it is not useful for training complex neural network summarization models.
3 Summarizing Academic Papers
3.1 Flat Abstract Summarization
A simple approach to summarizing academic papers would be to train sequence-to-sequence models using the full text of the article as source input and the abstract as reference summary. However, sequence-to-sequence models face multiple difficulties when given long input texts. A very long input sequence requires the encoder RNN to run for a lot of timesteps. This greatly increases the computational complexity of the forward pass. To make things worse, the training of the encoder on very long input sequences becomes increasingly difficult due to the computational complexity of the backward pass. The training becomes increasingly slower and in many cases the vanishing gradients prevent the model from learning useful information.
A solution to this problem would be to truncate very long sequences (more than 600 words), but this can result in serious information loss which would severely affect the quality of the produced summaries.
Even harder is the training of a decoder with very long output sequences. In this case, the computational complexity and memory requirements of the decoder make it pointless to try and train a model with very long reference summaries.
Another problem of this straightforward approach, is that the different sections of an academic paper are not equally important for the task of summarization. Sections like the introduction include core information for the summary, while others like the experiments are noisy and usually include little useful information.
SUSIE (StrUctured SummarIzEr) is a novel summarization method that exploits structured abstracts in order to address the aforementioned problems.
Most academic articles follow a typical structure with sections like introduction, background, methods, results and conclusion. When the abstract of the article is structured, it usually includes similar sections too. Our method, looks for specific keywords in the header of each section in order to annotate both the article and abstract sections. For example, sections that include keywords like methods, method, techniques and methodology in their header are annotated as methods. Table 1 presents the different section types and the keywords associated with them.
|literature||background, literature, related|
|results||result, results, experimental|
Once the article and abstract sections are annotated, we pair each section of the full text with the corresponding section of the abstract and create one training example per section. We can then use one of the existing summarization methods and train a model for the summarization of single sections. Summarizing a single section of an article is a much easier task since the input and output sequences are a lot shorter and the information is more compact and focused on specific aspects of the article. In addition, section annotation allows us to filter out particular sections that are not useful for summarization.
At test time, we extract the specified sections of the article and run the summarization model for each of them in order to produce section summaries. Then we combine those summaries in order to get the full summary of the article.
4 PMC Structured Abstracts
PubMed Central (PMC) is a free digital repository that archives publicly accessible full-text scholarly articles that have been published within the biomedical and life sciences journal literature. The PMC-SA (PMC Structured Abstracts) dataset was created from the open access subset of PMC, comprising approximately 2 million articles. We used the XML format downloaded from the PMC FTP server to create the dataset. Only the articles that have abstracts structured in sections were selected and included in the dataset. PMC-SA has a total of 712,911 full text articles along with their abstracts. The full texts of the articles have an average length of 2,514 words and are used as source texts for the summarization, while the abstracts have an average length of 260 words and are used as reference summaries.
We can easily apply SUSIE on PMC-SA since the XML format allows us to effectively split the full text and abstract into annotated sections. We create approximately 4 examples per article. The average length of each article section is 677 words and the average length of each abstract section is 130 words. It should be noted here that sections like background and methods are usually a lot longer than the introduction and conclusions
and as a result there is a high variance in terms of the length of different sections.
When compared with the existing datasets discused in section 2.2 PMC-SA is clearly different in multiple ways. The articles and summaries are significantly longer compared to the different newswire datasets and this makes it a much harder task. Also, the new dataset is a lot larger than the TAC 2014 dataset which is the only other dataset that consists of academic publications. This makes it suitable for the training of state-of-the-art summarization models. Finally, the XML format and structured abstracts allow us to easily apply SUSIE on this dataset.
|ROUGE-1 F1||ROUGE-2 F1||ROUGE-L F1|
|pointer-generator + coverage||0.3300||0.3716||0.1142||0.1466||0.2893||0.3296|
As we mentioned, SUSIE can be combined with a number of different summarization models. In order to evaluate the effectiveness of SUSIE, the three different summarization models that were described in section 2.1 are trained and evaluated on PMC-SA using both the flat abstract method from section 3.1 and SUSIE.
The training set has 641,994 articles, the validation set has 35,309 articles and the test set 10,111 articles. In all experiments we included for summarization only the introduction, methods and conclusion sections because we have found that these particular section selection gives us the best performing models. For the flat abstract method, the selected sections are concatenated and used as source input paired with the concatenation of the corresponding abstract sections as reference summary. For SUSIE, one example is created for each of the selected sections with the corresponding abstract section as reference summary. In Table 3 we provide detailed statistics about the training data used in the two different methods.
|# training articles||641,994||641,994|
|# training examples||641,994||1,211,826|
|avg. source length||1451||677|
|avg. summary length||260||130|
5.1 Experimental Setup
. The hyperparameter setup used for the models is similar to that of(See et al., 2017) and is detailed in the supplementary material.
In order to speed up the training process, we start off with highly truncated input and output sequences. In more detail, we begin with input and output sequences truncated to 50 and 10 words respectively and train until convergence. Then we gradually increase the input and output sequences up to 500 and 100 words respectively.
When using the flat abstract method, we truncate each section to words before concatenating them to get the input and output sequences, where is the required article length and is the number of extracted sections from this article.
The truncation of an academic article to a total of 500 words is definitely going to result in some severe information loss, but we deemed it necessary due to the difficulties described in section 3.1. To get the coverage model, we simply add the coverage mechanism to the converged pointer-generator model and continue training.
At test time, for the flat abstract method, we truncate each input section to with words and concatenate them to get an input sequence of 500 words. Then we run beam search for 120 decoding steps in order to generate a summary. For SUSIE, each of the selected sections is truncated to 500 words before we run beam search for 120 decoding steps to get a summary for each one of them. Then we concatenate the individual summaries to get the summary of the full article.
We evaluate the performance of all models with the ROUGE family of metrics (Lin, 2004) using the pyrouge package222pypi.org/project/pyrouge/0.1.3. In specific, we report F1 scores for ROUGE-1, ROUGE-2 and ROUGE-L. ROUGE-1 and ROUGE-2 measure the overlap, in unigrams and bigrams respectively, between the generated and the reference summary. ROUGE-L measures the longest common subsequence overlap.
Table 2 presents the results of our experiments. We can see that the pointer-generator model achieves higher scores than the simple attention sequence-to-sequence and adding the coverage mechanism further improves those scores which is in line with the experiments of (See et al., 2017).
We also notice, that SUSIE improves the scores of the flat summarization approach for all three models by as much as 4 ROUGE points. The performance of the best model, pointer-generator with coverage, is improved by approximately 13%, 28% and 14% in terms of ROUGE-1, ROUGE-2 and ROUGE-L F1 score respectively. It is clear that the flat approach suffers from information loss due to the truncation of the source input. In the supplementary material we provide examples to illustrate the difference in the quality of summaries.
This work focused on the summarization of academic publications. We have shown that summarization models that perform well on smaller articles have difficulties when applied on longer articles with a lot of diverse information, like academic articles. We proposed SUSIE, a novel approach that allowed us to successfully adapt existing summarization models to the task of structured summarization of academic articles. Also, we created PMC-SA, a new dataset of academic articles that is suitable for the training of summarization models using SUSIE. We found that training with SUSIE on the PMC-SA greatly improves the performance of summarization models and the quality of the generated summaries.
- Chopra et al. (2016) Sumit Chopra, Michael Auli, and Alexander M Rush. 2016. Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 93–98.
- Elkiss et al. (2008) Aaron Elkiss, Siwei Shen, Anthony Fader, Günş Erkan, David States, and Dragomir Radev. 2008. Blind men and elephants: What do citation summaries tell us about a research article? Journal of the American Society for Information Science and Technology.
- Grusky et al. (2018) Max Grusky, Mor Naaman, and Yoav Artzi. 2018. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. arXiv preprint arXiv:1804.11283.
- Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693–1701.
- Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out.
- Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gülçehre, and Bing Xiang. 2016. Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pages 280–290, Stroudsburg, PA, USA. Association for Computational Linguistics.
- Napoles et al. (2012) Courtney Napoles, Matthew Gormley, and Benjamin Van Durme. 2012. Annotated gigaword. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 95–100.
- Paulus et al. (2017) Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.
- Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.
- See et al. (2017) Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Stroudsburg, PA, USA. Association for Computational Linguistics.