GASP! Generating Abstracts of Scientific Papers from Abstracts of Cited Papers

by   Fabio Massimo Zanzotto, et al.

Creativity is one of the driving forces of human kind as it allows to break current understanding to envision new ideas, which may revolutionize entire fields of knowledge. Scientific research offers a challenging environment where to learn a model for the creative process. In fact, scientific research is a creative act in the formal settings of the scientific method and this creative act is described in articles. In this paper, we dare to introduce the novel, scientifically and philosophically challenging task of Generating Abstracts of Scientific Papers from abstracts of cited papers (GASP) as a text-to-text task to investigate scientific creativity, To foster research in this novel, challenging task, we prepared a dataset by using services where that solve the problem of copyright and, hence, the dataset is public available with its standard split. Finally, we experimented with two vanilla summarization systems to start the analysis of the complexity of the GASP task.



There are no comments yet.



A Study of Human Summaries of Scientific Articles

Researchers and students face an explosion of newly published papers whi...

How Did This Get Funded?! Automatically Identifying Quirky Scientific Achievements

Humor is an important social phenomenon, serving complex social and psyc...

TalkSumm: A Dataset and Scalable Annotation Method for Scientific Paper Summarization Based on Conference Talks

Currently, no large-scale training data is available for the task of sci...

Learning to Generate Posters of Scientific Papers

Researchers often summarize their work in the form of posters. Posters p...

Semi-Supervised Exaggeration Detection of Health Science Press Releases

Public trust in science depends on honest and factual communication of s...

Learning to Generate Posters of Scientific Papers by Probabilistic Graphical Models

Researchers often summarize their work in the form of scientific posters...

Data objects and documenting scientific processes: An analysis of data events in biodiversity data papers

The data paper, an emerging scholarly genre, describes research datasets...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Creativity is one of the driving forces of human kind. Learning helps to catch up with current knowledge and current understanding of the world. Instead, creativity allows us to break current understanding to envision new ideas, which may revolutionize entire fields of knowledge.

Scientific research offers a challenging environment where to explore and, eventually, learn regularities underlying the creative process: the creative process is here formal and documented in texts. In fact, scientific research is a creative act in the formal settings of the scientific method. “Standing on the shoulders of giants”, scientists have new ideas, which aim to go beyond current understanding of the world. These ideas stem from existing knowledge and from creative thinking of a group of scientists. Moreover, scientists are forced to document their creative process by publishing papers. In these papers, the creative process is somehow described. The background knowledge is declared, that is, references are provided and findings are described. Creativity in scientific research is tracked, documented.

In scientific research, the creative process can be seen as a text-to-text process. In fact, papers containing novel ideas are texts and are ”produced from” referred papers, which are texts. This is a tremendous opportunity to study how to replicate a well-defined creative process.

Deep neural networks are giving the impression that it is possible to attack and resolve sequence-to-sequence tasks and, hence, text-to-text tasks. Sequence-to-sequence (Seq2Seq) neural networks are used to learn conversational agents from dialog

Ghazvininejad et al. (2017). In this case, Seq2Seq NNs have to learn the relation between a stimulus utterance and a response utterance. There is not a direct relation between the two utterances but these systems have positive results. Moreover, Seq2Seq NNs are used for abstractive summarization Rush et al. (2015); Chopra et al. (2016); Nallapati et al. (2016); See et al. (2017). Abstractive summarization is a text to text task and it is a summarization that may use different words with respect to those in target documents. Finally, encoder-decoder architectures are used to generate textual captions of images and, also, medical reports from medical images. Text-to-text tasks seem to be in the possibility of today’s neural networks. In this paper, we dare to introduce a novel, challenging task of Generating Abstracts of Scientific Papers from abstracts of cited papers (GASP), we propose the GASP corpus and we experiment with the GASP corpus using three vanilla systems. The GASP task is a reduced version of the scientific creative process documented in papers. We define it a text-to-text task as follows: by having abstracts of cited papers, produce the abstract of the current paper. To foster research in this novel, challenging task, we prepared a dataset by using services where that solve the problem of copyright and, hence, the dataset is public available with its standard split. Finally, we experimented with three vanilla summarization systems to start the analysis of the complexity of the GASP task.

The major contributions of this paper are:

  • the GASP task - a novel, challenging task capturing scientific creativity;

  • the GASP corpus - a corpus where to test text-to-text systems for this novel, challenging task;

  • the initial analysis of three vanilla summarization systems on the GASP task.

The rest of the paper is organized as follows. Section 2 reports on the background and related work. Section 3 formally define the GASP task and describes the collected corpus. Section 4 shortly describes the three vanilla systems used in the experiments. Section 5 reports on the experiments with the vanilla systems on the GASP corpus. Finally, Section 6 draws some conclusions and envisages future activities.

2 Background and related work

Automatic generation of scientific papers has been attacked in the past and it has been seen as a way to test the scientific validity of specific conferences. Systems like SCIgen111 generate random papers by using a probabilistic context-free grammar in a generative way. Clearly, this is a very different case with respect to the GASP task.

Generating Abstracts of Scientific Papers from abstracts of cited papers (GASP) seems to be strongly related to abstractive multi-document text summarization

Nallapati et al. (2016); Gulcehre et al. (2016); Paulus et al. (2018). Indeed, abstractive text summarization has been already used to generate the related work section of scientific papers Hu and Wan (2014); Chen and Zhuge (2017) or to automatically generate survey papers Jiang et al. (2019).

GASP has similarities but also important differences with respect to abstractive summarization. In fact, in abstractive summarization target and, hence, generated summaries contain new phrases with respect to source documents: this is similar to our GASP task, where target abstracts generally may contain novel sentences, which are not in abstracts of cited papers. However, GASP is not only a summarization task. Scientific papers and, consequently, abstracts of scientific papers must contain a degree of novelty with respect to cited papers. Hence, GASP is definitely a novel, intriguing task, which aims to investigate scientific creativity.

Although different, abstractive multi-document text summarization can help to envisage how to use neural network models and how to evaluate different systems. Neural networks have gained a lot of attention in automatic abstractive summarization Dong (2018) since the task is seen as a sequence-to-sequence NLP task. The baseline system for encoding source texts seems to be a bag-of-word encoder Rush et al. (2015)

, while systems for automatic abstractive summarization use convolution neural networks (CNN)

Rush et al. (2015); Chopra et al. (2016)

and recurrent neural networks (RNN)

Nallapati et al. (2016); See et al. (2017).

Evaluating systems aimed to produce text is a very difficult and debated problem, both in Machine Translation (MT) and Automatic Summarization (AS). The risk is to refer to measures that penalize good behavior of systems. In MT or AS, it is important that systems produce output that semantically cover what is in the reference: if words are not exactly the same, it is not a major issue. Hence, the best way to evaluate systems is by using human assessment Coughlin (2003). However, this evaluation methodology is extremely expensive.

Many reference-based metrics have been proposed and largely used for evaluating MT and AS. Among all these metrics, the most common are BLUE Papineni et al. (2002), ROUGE Lin (2004) and METEOR Banerjee and Lavie (2005). These measures are different as based on different principles. BLEU Papineni et al. (2002) is precision-based: it counts how many

-grams of the generated text are in the ground-truth reference(s), with a penalization factor for repeated

-grams. Generated texts repeating -grams receive a lower score. However, BLEU gives better scores to short generated texts as, for any precision-based measure, the less systems say the better these systems are evaluated. On the other side, there is ROUGE Lin (2004), which is recall-based. In this case, the metric counts the number of -grams in references that are covered b y generated text. There is also a variant that uses the size of the longest subsequence in common between references and generated texts. Clearly, as any recall based metric, ROUGE favors long generated texts even if these text may contain irrelevant information. To mitigate the problems of fully precision-oriented and fully recall-oriented measures, METEOR Banerjee and Lavie (2005)

has been proposed as the harmonic mean between precision and recall over unigrams. In METEOR, references and generated texts are aligned before the computation of recall and precision. Then, this measure uses also a penalizing factor, which detects fragmentation by counting unigrams that are close in references but are far in generated texts.

The GASP task we propose here has evaluation issues as those in Machine Translation and Automatic Summarization. Hence, we should be careful to choose the correct metrics, in order not to penalize good GASP task solutions.

(a) Source abstracts per target papers
(b) Words in papers
(c) Starting contribution signal
Figure 1: The GASP task: corpus facts and statistics

3 Capturing Scientific Creativity in a Formal Task and in a Shared Dataset

The following sections provide a thorough definition of the GASP task, the procedure and the data sources for building the GASP corpus and the corpus itself.

3.1 GASP! The Task

The GASP task aims to capture, at least in part, the creativity used to produce novel ideas. For this purpose, GASP refers to scientific papers, which describe the creation of novel ideas by using existing ideas, that is referenced papers. For the sake of simplicity, the definition of the GASP task is built upon abstracts of papers and not on full papers.

The GASP task is aimed to producing the abstract of a paper - the target abstract - given the abstract of the set of referred papers - the source abstracts - which we may assume have been inspirational for the idea in the target paper. Hence, the formal definition is the following:

where is the target abstract and is the set of source abstracts of the inspirational papers. The task is then to produce by reading all . Both and are to be intended as sequences of words. The following is a example of , which shows what the GASP task aims to be:

: United States prisons release more than 600,000 individuals each year. Within three years of release, 50 percent of released prisoners are back in prison. […] I find that inmates who participate in work release have better post-prison employment outcomes.
: Employment programs for disadvantaged male youth have been suggested as a possible new weapon in America’s War on Drugs. In this paper panel data at the neighborhood level are used to investigate [..]
: This paper empirically assesses the wage effects of the Job Corps program, one of the largest federally-funded job training programs in the United States. […]
: Weestimate the post-release economic effects of participation in prison-based General Educational Development (GED) programs using a panel of earnings records […]

By using the sequences of words in , and , systems are expected to produce =United States prisons release more […]. As shown in the example, abstracts have an important feature. Abstracts are divided in two halves: a sort of introduction and the discussion of the novel idea. These two parts are split by signal word. In the above example these words are represented in bold text.

3.2 Corpus Statistics and Facts

Hence, the GASP task is a text-to-text task, which is similar to many sequence-to-sequence tasks. However, the GASP task is clearly different with respect to other text-to-text tasks as: (1) required outputs, that is target abstracts, and inputs, that is, source abstracts, are generally longer than in other tasks; (2) large part of target abstracts is novel information generally introduced by signal words. For these features, GASP is extremely difficult.

3.3 Data Sources and Extraction Procedures

There are millions of records for paper metadata, which are publicly available. Hence, we have a huge opportunity to build the GASP corpus.

Among all the data sources, we collected the GASP Corpus by using Semantic Scholar222 There are at least two good reasons for using Semantic Scholar for the GASP Corpus: (1) Semantic Scholar is already open to publicly share meta data of papers, since Semantic Scholar Open Research Corpus Ammar et al. (2018) is already available; (2) Semantic Scholar offers APIs to eventually reconstruct the corpus or fill missing information by asking directly at the live servers. Hence, Semantic Scholar gives the opportunity to build an open corpus and gives a guarantee that GASP can be shared333The corpus can be downloaded at

Then, the GASP Corpus has been then build by selecting papers from the Semantic Scholar Open Research Corpus (SSORC) Ammar et al. (2018). In SSORC, papers are represented with many metadata: for GASP, we are interested to the following fields: abstract and outCitations, which are the referred papers captured in the corpus. We selected 120,000 papers that have the outCitations with a least one paper. We aimed to build up a corpus with a training, testing and validation set of, respectively, 100.000, 10.000, and 10.000 instances. Hence, for each paper in the 120,000, we built an instance line as follows: is directly taken from the metadata of the paper; , …, are extracted by reading the list in outCitations and recovering abstracts from inside SSORC or by using the web APIs of Semantic Scholar 444 if at least of outCitations list have been already covered.

In order to have the possibility to explain results of text-to-text neural networks on the GASP task, we here examine some facts of the GASP Corpus. In fact, this corpus has some particularities which will challenge these systems based on NNs.

First, the length of most target abstracts is below 400 words with peaks at 100, 150, 200, 250, 300 and 400 justified by the existence of some limits in the paper format (see Fig. 0(b)). There is a very long tail with abstract reaching the length of 1,400 words. This length of output texts and input texts are very challenging for end-to-end systems.

Second, the number of source abstracts per target paper is extremely small. In fact, most of the target papers have only up to four source abstracts (see Fig. 0(a)). This fact depends on three different sources of problems: 1) target papers have really up to four cited papers; 2) the procedure to analyze citations using SSORC failed to cover some citation references; 3) our extraction procedure failed to recover abstracts of some referred paper even if metadata exist for these cited papers. Having few input abstracts can be a potential problem. In fact, the task could be really hard to solve and a system could be unable to generalize correctly.

Third, target papers relies on different sets of source papers. In fact, given the subset of target papers with only one source paper, the ratio of unique source papers with respect to all source papers is 0.89. This ratio is around 1 for all the target papers that have more than one source papers. This means that nearly each set of input papers is unique. Hence, it is guaranteed that given a set of source papers, there is only one single target paper that is produced. The GASP dataset is then coherent as only one idea is derived from a set of inspiring ideas.

Finally, one other important fact is the distribution of pre-existing knowledge with respect to novel ideas in target and source abstracts. This is extremely important as pre-existing knowledge should be easier to reproduce with respect to novel knowledge. Abstracts are generally organized in two parts. A first part describes the problem and current solutions. A second part describes the novel idea proposed in the paper. There is a clear separation between the two parts as the second part is introduced by a signal phrase. We analyzed the GASP Corpus by using the following signal phrases ’I’, ’We’, ’in this paper’, ’in this work’, ’our approach’, ’our work’, ’this paper’, ’this work’, ’this study’. Abstracts in the GASP Corpus are unbalanced since most of these abstract report on the novelty introduced by the paper. In fact, a large amount of abstracts have the signal phrase before the 30% of their length (see Fig. 0(c)). Hence, the GASP task is extremely challenging as most of the content in abstracts is novel and cannot result from a simple summarization of source abstracts.

4 Vanilla Systems

We considered the GASP task as an abstractive summarization task. In a text summarization task, input is composed by pairs where the input is composed by a series of tokens and the output is composed by a series of tokens , with , in general. However, in the GASP Corpus input is composed by a variable-lenght series of papers each of which is composed of a series of tokens . In our systems we decided to concatenate the sequence of tokens of each paper to reduce the problem to a general text summarization problem. Each token was given to the systems as it is without a particular preprocessing procedure.

In the next subsections, we proceed by describing each system in detail.

Extractive Summarization Abstractive Summarization
TextRank BiLSTM Gold Target Abstract

In this paper, we use a unique set of individual-level proposition voting data from Los Angeles County that allows us to estimate the distribution of voter preferences, including the mean, median, and variance (heterogeneity), for various subsets of voters in each of 55 State Senate, State Assembly, and U.S

In this paper we examine the impact of citizen on the distribution of voter preferences in the context of the plating roll We show that there is a strong relationship between the preferences of the median and the preferences of the median We show that there is a relationship between Contrary to popular opinion we find evidence that the views of residents of both higher and lower income neighborhoods are represented by their legislators. Analyzing the voting behavior of California state legislators on 77 proposals on which both the legislature and the public cast ballots we find first that […]

The TIPSTER Text Summarization Evaluation (SUMMAC) has established definitively that automatic text summarization is very effective in relevance assessment tasks. 1 Introduction In May 1998, the U.S. government completed the TIPSTER Text Summarization Evaluation (SUMMAC), which was the first large-scale, developer-independent evaluation of automatic text summarization systems. 1.1 Text Summarization Text summarization
In this paper we propose a new method for text summarization in the context of text summarization in the context of text summarization systems. The method is based on a set of text summarization tasks that are used in the context of text summarization tasks. The method is based on This paper describes a framework for multidocument summarization which combines three premises: coherent themes can be identified reliably; highly representative themes running across subsets of the document collection can function as multi-document summary surrogates; and effective end-use of such themes should be facilitated by a visualization environment which clarifies […]

Table 1: Sample System Outputs and expected target abstracts

4.1 TextRank

a As the name implies, this system was built using the TextRank algorithm Mihalcea and Tarau (2004). Input abstracts were concatenated as a whole document separated by the token \n and the result was the input to the TextRank algorithm. Clearly the TextRank algorithm is not particularly suited for the GASP task due to the fact that it is an extractive text summarization algorithm. This means that words, phrases, sentences of the output abstract are selected from source abstracts, hence this algorithm, and in general any extractive text summarization algorithm, is not able to capture the creative-generative process behind authors’ intent which is the aim of the GASP task. However, we decided to use this system as a baseline anyway, being aware of the limitations it has.

4.2 BiLSTM

BiLSTM is a standard sequence-to-sequence system used for text summarization trained with maximum likelihood loss function for the sequence labeling problem. As for the TextRank system, the input sequence is composed of input abstracts concatenated as a whole document and separated by the tag

[SEP] to delimit the end of an abstract and the beginning of the next one. The output sequence is simply the output abstract. The system was built with a bidirectional LSTM Hochreiter and Schmidhuber (1997) with attention Bahdanau et al. (2014) with copy mechanism Vinyals et al. (2015) that allow it to copy input words to the output.

R2 Recall R2 Precision R2 F1 R1 Recall R1 Precision R1 F1 RL Recall RL Precision RL F1
TextRank 0.053 0.018 0.024 0.325 0.118 0.161 0.207 0.074 0.101
BiLSTM 0.019 0.058 0.027 0.109 0.334 0.154 0.077 0.238 0.108
Table 2: Rouge performance of TextRank and BiLSTM on the testing of the GASP Corpus

5 Experiments

We performed some initial experiments to valuate the complexity of the GASP task. Hence, we experimented with the GASP corpus and with the vanilla systems presented in the previous section. After a description of the experimental details in Sec. 5.1, we discuss results in Sec. 5.2.

5.1 Experimental Set-up

To have the possibility to, at least, perform the experiments, we constrained the GASP task: we cut target abstracts to 50 words. This is needed for the computational cost of the abstractive summarization models we used. Analyzing this reduced version of the GASP corpus is still interesting to start to explore the complexity of the task.

In the experiments, we used the following implementations of the above vanilla systems. For the TextRank system, we used the implementation from Barrios et al. (2016)555 For comparison purposes and not for computational limitations, we constrained the output of the system to abstract up to 50 words. This limits allow to compare resuls with the other system we used. For the abstractive summarization based on BiLSTM, we used OpenNMT Klein et al. (2017) with the configuration for abstractive text summarization Gehrmann et al. (2018). The system implements a bidirectional LSTM of units. For computational purposes, we constrained the input to

words. This means that part of the source abstracts are not considered during the training. Target, output abstracts are already constrained to be less than

words. As optimizer, we used Adagrad Duchi et al. (2011) with a learning rate of , an initial accumulator value of . The size of the batch is and we run epochs. To allow replicability of the experiments, we released the configuration file666

To compare systems, we used the python implementation easy-rouge777 of the Rouge Lin (2004)evaluation metric since these measure is widely used to evaluate text summarization systems. In particular we evaluate Recall, Precision and F1 measure for the metrics Rouge-1 (R1), Rouge-2 (R2) and Rouge-L (RL).

5.2 Results and Discussion

Both the extractive and the abstractive systems produce some reasonable text for target abstracts. For example, line 1 of Table 1 shows the extractive summarization that describes the relation between preferences of voters and voting data. This is similar to what is described in the gold target abstract. Whereas, line 2 shows the absractive summarization system BiLSTM that produces an abstract related to summarization and the gold target abstract is in fact related to multidocument summarization. From these examples, it seems that BiLSTM system fails to get the creative-generative intent behind the authors, but it is able to get the topic of target abstracts and give same coherent text around the topic.

In general, the extractive text summarization algorithm TextRank tends to have and higher recall and a lower precision respect to the other system (see Table 2

). This happens because the TextRank algorithm reuse words ans sentences in source abstracts to match the fixed 50 words length of the summary. In contrast, the abstractive BiLSTM system tends to be more precise losing points in recall. However, some BiLSTM outpus are really extremely odd. For example, mostly for medical paper, the BiLSTM system performs poorly producing short texts like:

Clinical pulmonary a a a a a a a a a a

. This fact is due probably to topic distribution of papers in the training set that could be overcome with an extended version of the GASP corpus.

All the systems performed very bad on the task. This confirms the complexity of the GASP task. We think that the reason why all the systems performed in this way is strictly related to the nature of the task: trying to learn a statistical correlation between input papers and output paper is not enough to capture the creativity intent of the authors. We speculate that trying to solve the task is strictly related to a deep understanding of the generative process behind the writing process, hence we propose the GASP task as a way to build better machine learning models capable to grasp author’s intent.

6 Conclusions

To the best of our knowledge, this is the first paper that introduces the task of modeling scientific creativity. By proposing the task of generating abstracts of scientific papers from abstracts of cited papers, we opened an opportunity to build text-to-text systems attacking a task we don’t want to solve. We picked the hanging GASP dataset, which has been always in front of us, we delivered it and we started to analyze the performance of existing vanilla systems. Luckily, the timid results are still far from being satisfactory but show some encouraging directions of study. Alea iacta est

. We believe GASP poses an intriguing, difficult, and philosophically important challenge for the artificial intelligence field.