Citation Text Generation

02/02/2020 ∙ by Kelvin Luu, et al. ∙ University of Washington Allen Institute for Artificial Intelligence 0

We introduce the task of citation text generation: given a pair of scientific documents, explain their relationship in natural language text in the manner of a citation from one text to the other. This task encourages systems to learn rich relationships between scientific texts and to express them concretely in natural language. Models for citation text generation will require robust document understanding including the capacity to quickly adapt to new vocabulary and to reason about document content. We believe this challenging direction of research will benefit high-impact applications such as automatic literature review or scientific writing assistance systems. In this paper we establish the task of citation text generation with a standard evaluation corpus and explore several baseline models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The output of the world’s scientists doubles roughly every nine years (Bornmann and Mutz, 2015), and their pace is quickening. As a result, scientists and other experts must devote significant time to the difficult task of literature review, or coming to understand the context in which they work. Might artificial intelligence help to reduce that time?

Figure 1: Overview of Citation Text Generation Task. Given a source and cited document, the goal is to write the sentence describing the specific relationship between the two. For the same source document, the output will vary depending on the cited document.

Several lines of research seek to do so. Citation recommendations systems (Valenzuela et al., 2015; Bhagavatula et al., 2018; Cohan et al., 2019) suggest references to relevant published work for a given document such as a current draft. Summarization systems (Cohan and Goharian, 2015; Yasunaga et al., 2019) condense the information in one or more documents, allowing researchers to more quickly understand the basic ideas in a piece of research.

We introduce a complementary—but so far unaddressed—problem, citation text generation, where the relationship between a document and one or several others is expressed in natural language text. This differs from traditional summarization in that the primary focus is explaining the relationship between the two documents rather than their content. Automatically describing inter-document relationships could dramatically decrease the time researchers devote to literature review. For instance, a new paper could be explained in terms of its relationships to relevant works that a particular reader is most familiar with, rather than just those which the authors elected to cite (personalization). Further, such technology could be incorporated into writing assistance systems to help less experienced or non-native writers better articulate the connection between their work and prior art. Additionally, users of citation recommendation systems can benefit from natural language explanations of recommendation system choices.

Beyond the immediate utility of citation text generation systems, the task offers significant challenges for language understanding and generation research. A major challenge is how to represent the information in one or more scientific texts. These documents are longer than those in most other domains typically studied in NLP, and make use of a long-tailed, open-domain technical vocabulary. Often an important phrase in the citing sentence output occurs only in a specific cited document and not elsewhere in the corpus. This requires a model that can learn phrase meanings from very few exposures, an important but unsolved problem for text generation systems. Possibly more challenging is understanding and expressing the various and nuanced relationships between related scientific works.

In this work, we introduce the task of citation text generation. Leveraging the full texts of English-language scientific articles, we construct a dataset of citation sentences in the computer science domain for training and evaluating citation text generation models. We investigate strong retrieval and neural baseline models against which future work can compare. For use cases where large models can be trained, we extend the successful GPT2 architecture (Radford et al., 2019) to the scientific domain with additional pre-training and subsequent fine-tuning on the citation generation task. We experiment with different kinds of document context in the fine-tuning and inference stages. We also explore retrieval-based techniques which may more easily generalize to lower-resource settings. These models retrieve citation sentences from training documents which are most similar to test inputs. Our evaluations show that these techniques often produce plausible citation sentences, but indicate clear directions for improvement. Code and artifacts are provided for future research.

2 Task

Given the important research challenges posed by the citation text generation task, along with the potential social benefits of its solutions, let us continue with a formalization of the problem. Citation text generation is the task of generating a natural language citing sentence which explains the relationship between two documents. Examples of such citing sentences can be found in scientific documents as in-text citations to a previous work. Thus, we will formally distinguish one document as the source document, from which we will draw citing sentences which reference the cited document.

If we want to leverage powerful modern neural text generation systems, we are faced with the problem of how to represent the documents in a way that these models can consume. In particular, language models like GPT2 are trained to predict next token probabilities given long stretches of contiguous text from a single document. It is not clear how to mix information from more than one document when providing context to these models.

An additional difficulty of the citation text generation task is the vocabulary. In this domain, low-frequency, highly meaningful terms regularly appear in output texts. These terms may be completely novel to a single or small collection of papers (consider the phrase “citation text generation”, for instance), yet they are necessary for explaining the paper.

This framing suggests a supervised learning setup. Let

denote a citing sentence drawn from , and denote without . Then let


be the probability of given , cited document , and model parameters . The goal of learning a citation text generation model would be to maximize this probability across a large number of triples, so long as the parameters also generalize to unseen instances. At inference time, the goal is to generate a sentence which accurately describes the relationship between and .

The most appropriate evaluation metric for most text generation tasks is human judgment by potential users of the system. Evaluating citation text requires human judges with scientific expertise. For exploratory purposes, we use the standard automatic metrics for text generation tasks described in Section 

4, and we an expert error analysis in Section 5.1.

For source and cited documents, we use English-language computer science articles and annotation from the S2-GORC dataset (Lo et al., 2019). S2-GORC is a large citation graph dataset which includes full texts of 8.1 million scientific documents. We select a subset of 154K computer science articles as our corpus. From these, we extract 622K citing sentences that link back to other documents in our corpus. We hold 2500 examples for each of the validation and test sets. Detailed statistics can be found in Table 1.

total average/doc.
documents 154K
tokens 813M 5.3K
unique tokens 7.1M 1.3K
citing sentences 622K 4.0
citing sentence length 30.3
Table 1: Dataset statistics.

3 Models

We explore two basic styles of model for citation text generation. Following current work in neural text generation, we fine-tune the predictions of a large pre-trained language model to the citation text generation task. Additionally, we investigate approximate nearest neighbor methods to retrieve plausible human-authored citation sentences from the training data.

3.1 Neural Text Generation

Recent work has shown that adapting large pre-trained language models to text generation tasks yields strong results (Zellers et al., 2019). Due to its widespread use in text generation, we investigate the GPT model of Radford et al. (2019) for the citation text generation task. GPT2 is a transformer model trained on 40 gigabytes of internet text with a language modeling objective (Vaswani et al., 2017). The adaptation process, called fine-tuning, involves continued training of the model on the target objective, in our case citation text generation.

To fine-tune GPT2 for text generation, it is typical to concatenate the conditioning context and citing sentence with a special separator token . The model learns to approximate next token probabilities for each index after :


for and model parameters . Cross-entropy loss is calculated for each

and backpropagation is used find parameters

which maximize .

To adapt Equation 2 to the citation text generation task, we construct the conditioning context from the source and cited documents. We take tokens from source document along with tokens from the cited document . (Which tokens are drawn from the two documents is an independent variable that we explore experimentally.) We then condition the generation of citing sentence on . This model is trained to predict each token of as described above.

BLEU Rouge-1 Rouge-2 Rouge-L SciBERTScore
abs abs 8.98 0.141 0.011 0.101 0.668
abs intro 9.01 0.142 0.010 0.100 0.667
abs sample 8.95 0.124 0.009 0.089 0.667
intro abs 9.65 0.124 0.015 0.093 0.658
intro intro 9.68 0.123 0.015 0.093 0.657
intro sample 9.51 0.120 0.013 0.091 0.653
IR 11.02 0.162 0.023 0.115 0.656
IR (BLEU) 10.92 0.161 0.022 0.114 0.655
IR (SciBERTScore) 10.88 0.120 0.015 0.090 0.654
Table 2: Automatic evaluation of generated texts. Differences between entries in bold are statistically significant.


The primary question we investigate with this model is what kind of input is best for generating accurate and informative citation sentences. Prior works in citation recommendation have made use of abstracts, which perhaps act as sufficient summaries of document content for this task. Additionally, we explore variants of extended context, such as the introduction or first section after the abstract. Since scientific texts are too long to fit into the context window of our generation model, we also investigate a “sampling” approach which samples sentences from throughout the document until the context window is full. In this work, we combine either the abstract or introduction of the source document with each of the abstract, introduction, or sampled sentences from the cited document.

3.2 Retrieval with Approximate Nearest Neighbors

While neural text generation techniques have advanced significantly in recent years, they are still inferior to human authored texts. For some tasks, it is better to retrieve a relevant human-authored text rather than generating novel text automatically (Fan et al., 2018). Is this also the case for citation text generation?

To answer this question, we adapt an approximate nearest neighbor search algorithm to find similar pairs of documents. The basic search procedure is as follows: Given a test instance input for source and cited document , we find the set , the nearest neighbors to in the training data. For each document from , let be the set of documents that cite . This means that each contains at least one citing sentence which cites . We return the associated with the pair from the training which is closest to .

We measure the closeness of two pairs of documents by measuring cosine distances between vector representations of their content. The abstract of each document is embedded into a single dense vector by averaging the contextualized embeddings provided by the SciBERT model of

Beltagy et al. (2019) and normalizing. The distance between and candidate is computed as:


where and control the relative contribution of the two document similarities. We explore setting both and to 1, or tuning them to optimize either BLEU or BERTScore on the validation set.

3.3 Language Model Pretraining

GPT2-based models have demonstrated an ability to capture long distance dependencies over hundreds of tokens, which we hypothesize will allow them to synthesize information in both the source and cited documents. But citation text generation models must also handle the challenging technical vocabulary of the scientific domain.

Prior work has shown that pretraining on in-domain data improves the performance of large language models on domain-specific tasks (Beltagy et al., 2019). Inspired by this, we experiment with additional pretraining of GPT2 in the science domain. This model, SciGPT2

, is trained for an additional 3 epochs over the full text of the documents in our corpus using a language modeling objective. We note that both

SciGPT2 and the SciBERT language models used here have been exposed to citing sentences from the test and validation sets as in-line citations during their pre-training phases, which may improve their performance versus models without this exposure. Such exposure is typical when using pretrained language models, as text from test data cannot be guaranteed to be absent from the large task-independent corpora upon which these models are trained.

4 Evaluation

We compare the different baseline systems using BLEU (Papineni et al., 2002), ROUGE (specifically ROUGE 1, 2, and L; (Lin, 2004)), and the recently introduced BertScore (Zhang et al., 2019), a similarity metric based on BERT embeddings which has been shown to correlate well with human judgements on other tasks. To adapt the BertScore metric to the scientific text domain, we use SciBERT embeddings.

Table 2 (above the double line) shows the performance of the SciGPT2 model on the test set when provided with the different input context combinations outlined in Section 3.1. We find that context does make a difference for this category of model, and that models which have access to the intro of the documents outperform those which use abstracts or sampling.

Automatic evaluation of the retrieval-based methods on the test data are shown below the double line in Table 2. This table shows that the retrieval methods perform well on this task. However we will show the limitations of these automatic metrics in Section 5.1. We also observe that tuning the and parameters on the validation set results in overfitting for this method. Outputs are largely unchanged by this tuning; fewer than 400 test datapoints differ from the untuned outputs. A larger validation split may alleviate this problem.

Statistical significance is assessed for select results using bootstrapping with 1000 samples in each of 100 iterations. This test shows that conditioning on the introduction of the source document improves performance compared to conditioning on the abstract when using the SciGPT2 model. However, we see that IR methods perform better than the best neural models.111 after Bonferonni correction for both cases.

We do not find enough evidence to reject the null hypothesis regarding what context from the cited document should be used.

SciGPT2 (abs abs) IR (untuned)
One Visible Both Visible One Visible Both Visible


Believable 32 22 37 23
Content-Dependent 12 8 5 1
Not Believable 6 20 8 26


Believable 24 19 17 16
Content-Dependent 8 6 2 0
Not Believable 18 25 31 34
Table 3: Error analysis of SciGPT2 and IR generated texts.

5 Analysis

In this section we take a closer look at the details of the SciGPT2 and IR system outputs on a collection of validation datapoints. We provide a quantitative error analysis as well as qualitative analysis and examples.

5.1 Errors

In order to better understand the performance of the models, we undertake a quantitative analysis of its output. One author randomly selected 200 datapoints from the validation set and their associated model outputs. Source and cited papers in the topic of NLP were used so as to facilitate expert judgement. For tractability, we limited the context presented to the annotator to the document abstracts and analyze the outputs of the abs abs and IR systems.

In this analysis, we ask whether the models are producing believable citing sentences given their input. In particular, we are interested in the relative believability of the SciGPT2 and IR systems, as well as how believability of a citing sentence changes when a reader can see the abstract of one document or both.

We use 100 datapoints with outputs from the SciGPT2 system and 100 with outputs from the IR system. For 50 datapoints from each system, the cited document’s abstract is initially masked such that only the source context is visible (Source, One Visible). Based only on the source context, the annotator judged whether the model output (1) could have convincingly been a citation in the source document based solely on the abstract (believable), (2) could have been a citation in the source document, but unclear from the abstract alone and depends on the rest of the paper’s content (content-dependent), or (3) is unlikely to appear in this document (not believable). After making this judgment, the annotator was then shown the abstract of the cited document and asked to make the 3-way believability judgment based on both source and cited abstracts (Source, Both Visible). This process is repeated with the remaining 50 datapoints, but with the cited context masked initially (Cited, One Visible and Cited, Both Visible).

The results of our analysis presented in Table 3. We find that believability in the Cited, One Visible condition correlates well with the Cited, Both Visible condition. In the Source conditions, we see a greater difference in believability between One Visible and Both Visible. These findings makes sense: in-line citations often summarize a prior study rather than highlight the paper’s own contributions. Together, these results indicate that the believability of citing sentences is more related to the cited document than to the source.

Another interesting feature of this analysis is the difference between SciGPT2 and IR in terms of context-dependent citing sentences. We observe fewer such judgements in the IR outputs. This is probably due to the fact that neural text generation systems such as SciGPT2 will sometimes produce generic, uninformative outputs while the IR system outputs are usually specific enough that a stronger believability judgement can be made.

We also observe an overall higher instance of not believable judgements of the IR model outputs. This implies that automatic metrics such as BLEU, where the IR system scored higher than SciGPT2, do not correlate with factual accuracy in citation text generation.

Example citations and annotations are shown in Table 4. We find that in the cases where the model generated outputs are unconvincing they are still on topic. All 10 cases in the Source, One Visible and 4 of the cases in Cited, One Visible that were no longer believable in the Both Visible conditions exhibit this quality. A common example (4 cases) of this phenomenon occurs when the model output references a dataset. While the dataset would be potentially relevant to both papers, the cited papers focus on modeling contributions and do not introduce a novel corpus.

Source (Visible)
This paper investigates the interplay between different types of user interactions on Twitter, with respect to predicting missing or unseen interactions …Interestingly, the most predictive features vary with the user profiles, and are not the same across all users. For example, for a pair of users that interact with a large number of other Twitter users, we find that certain”higher-dimensional”triads, i.e., triads that involve multiple types of interactions, are very informative, whereas for less active Twitter users, certain in-degrees and out-degrees play a major role. …
We study online social networks in which relationships can be either positive (indicating relations such as friendship) or negative (indicating relations such as opposition or antagonism). Such a mix of positive and negative links arise in a variety of online settings …
(Cite) analyzed tweets as graph streams for predicting friendship relationships, although they focused on friendship relationships, not triads.
Believable Believable; citation distinguishes that the source document deals with triads while the cited document does not

We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage …After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets …

Cited (Visible)

Variants of Naive Bayes (NB) and Support Vector Machines (SVM) are often used as baseline methods for text classification, where words are used as feature for training a classifier. This generally involves a huge number of features. This generally involves a huge number of features. Some techniques, such as Latent Semantic Analysis (LSA) …


To test this, we used the task of Sentiment Analysis and Sentiment Analysis by

(Cited) as a benchmark, and the work presented in this paper as an additional testbed to evaluate the effectiveness of word vectors on other tasks.
Believable Not Believable; while repetitive, the cited document does perform sentiment analysis. The source document, however, does not.
Source (Visible)

Recognition Recent models of emotion recognition strongly rely on supervised deep learning solutions for the distinction of general emotion expressions. However, they are not reliable when recognizing online and personalized facial expressions, e.g., for person-specific affective understanding. In this paper, we present a neural model based on a conditional adversarial autoencoder

The continuous dimensional emotion modelled by arousal and valence can depict complex changes of emotions. In this paper, we present our works on arousal and valence predictions for One-Minute-Gradual (OMG) Emotion Challenge. Multimodal representations are first extracted from videos using a variety of acoustic, video and textual models …
This dataset contains 85,110 image-class videos and their respective emotion labels (Cited).
Believable Not Believable; cited paper seems to using the dataset from a challenge rather than introducing a datset.
Source (Visible)
The bag-of-words (BOW) model is the common approach for classifying documents, where words are used as feature for training a classifier, but their performance varies greatly depending on the model variant, features used and task/ dataset. …
Variants of Naive Bayes (NB) and Support Vector Machines (SVM) are often used as baseline methods for text classification, where words are used as feature for training a classifier. This generally involves a huge number of features. This generally involves a huge number of features. Some techniques, such as Latent Semantic Analysis (LSA) …
We use the subset of these datasets from (Cited).
Believable Not Believable; the cited paper seems to be an analysis paper and does not introduce any novel datasets on sentiment analysis.
Table 4: Curated examples from the annotator’s error analysis.

5.2 Examples

Example system outputs for randomly selected validation instances are shown in Table 5. We see that both the SciGPT2

and IR model outputs regularly hit on the correct broad topic of the cited text (such “literary analysis” or “image captioning evaluation metrics”). It is notable that the

SciGPT2 model outputs syntactically correct and coherent citation sentences, even given the difficulty of the vocabulary in this domain. This is a testament to the power of the domain-specific language model training.

We also observe that the outputs of the SciGPT2 model are often shorter than the desired citing sentence. Brevity is a known issue for neural text generation and may be alleviated by penalizing brevity in the inference procedure. More problematic are the factual errors in the generated text. In the last example, for instance, we see that SciGPT2 fails to cite the specific image captioning dataset described in the cited paper (Pascal1K) and instead focuses on the more general evaluation metric for the image captioning task (CIDEr). This is typical of neural text generation systems, which often assign high probability to generic or frequent phrases and revert to these in the face of uncertainty.

5.3 Future Work

The fluency and topical relevance of the baseline models show the plausibility of the citation text generation task as well as the utility of including pretrained scientific language models in future models. But based on the kinds of errors we have seen, future work should focus on two complementary goals: ensuring the factual accuracy of the generated text and improved modeling of the cited document. Factual accuracy is difficult to enforce in statistical text generation systems, especially where inference includes sampling procedures. Grounding to knowledge bases could help. For this task, knowledge extracted from candidate generations could be compared with knowledge from the full source and cited documents to prune false or irrelevant statements. Further, modeling input documents as knowledge graphs of their contents may help these algorithms better understand the cited document, resulting in better outputs. However, such a model will have to address the open problem of combining pretrained language models with graph encoding techniques.

Learning to classify unseen class samples at test time is popularly referred to as zero-shot learning (ZSL). If test samples can be from training (seen) as well as unseen classes, it is a more challenging problem …
State-of-the-art methods for zero-shot visual recognition formulate learning as a joint embedding problem of images and side information. In these formulations the current best complement to visual features are attributes …
Nevertheless, our model is able to obtain competitive results with (Cited).
Citing sentence
For CUB dataset, we use CNN-RNN textual features (Cited) as class attributes, similar to the approaches mentioned in Table 5 and 2.
Secure communication over a wiretap channel is considered in the disadvantaged wireless environment, where the eavesdropper channel is (possibly much) better than the main channel. …
We consider the secure transmission of information over an ergodic fading channel in the presence of an eavesdropper. Our eavesdropper can be viewed as th e wireless counterpart of Wyner’s wiretapper. …
In (Cited), an optimal SWIPT scheme was proposed with perfect CSIT.
Consider the channel model shown in Figure 1 , which reflects the understanding that in an adversarial game in modern communication systems, it is the interference effects on wideband receiver front-ends rather than the baseband processing that is the significant detriment (Cited).
Citing sentence
However, public discussion schemes result in low secrecy rates in scenarios of interest (as discussed in detail in (Cited)), and the technique proposed here can be used in conjunction with public discussion approaches when two-way communication is possible.In this work, we exploit current hardware limitations of the eavesdropper to achieve everlasting security.

Indian epics have not been analyzed computationally to the extent that Greek epics have. In this paper, we show how interesting in sights can be derived from the ancient epic Mahabharata by applying a variety of analytical techniques based on a combination of natural language processing, sentiment/emotion analysis and social network analysis methods. …

We present a method for extracting social networks from literature, namely, nineteenth-century British novels and serials. We derive the networks from dialogue interactions, and thus our method depends on the ability to determine when two characters are in conversation. …
The authors of (Cited) presented a method of characterizing the motivations for writing the essays by examining the topical influence of characters
We present an approach to the extraction of family relations from literary narrative, which incorporates a technique for utterance attribution proposed recently by (Cited) .
Citing sentence
Robert (Cited) defined the eight basic emotion types.

Automatic description generation from natural images is a challenging problem that has recently received a large amount of interest from the computer vision and natural language processing communities. In this survey, …

Crowd-sourcing approaches such as Amazon’s Mechanical Turk (MTurk) make it possible to annotate or collect large amounts of linguistic data at a relatively low cost and high speed. However, MTurk offers only limited control over who is allowed to particpate in a particular task. …
Evaluation was performed using the CIDEr metric (Cited).
The last and the most challenging dataset, Pascal1k (Cited), is a collection of images with associated natural language sentences.
Citing sentence
The Pascal1K sentence dataset (Cited) is a dataset which is commonly used as a benchmark for evaluating the quality of description generation systems.
Table 5: Randomly selected examples of system inputs and outputs from validation set.

6 Related Work

The current work builds on recent research in scientific document understanding, including citation recommendation and categorization, as well as scientific document summarization.

Citation recommendation, or the task of selecting works related to a source document which would be suitable for citing, is a longstanding goal of AI research (McNee et al., 2002; Bhagavatula et al., 2018; Nallapati et al., 2008). Recently, researchers have sought to categorize citations using various ontologies of citation intents. Valenzuela et al. (2015) sought to discern “highly influential” citations from others. Jurgens et al. (2016) uses six categories including “motivation”, “uses”, and “future work” among others. Cohan et al. (2019) condense this ontology to just three: “background”,“method”, and “result comparison”.

We view the citation text generation task as an extension of these classification approaches with distinct advantages. While classification requires an extant citation link to exist, our generation task can describe possible relationships between works which do not cite each other, such as contemporaneous works. Additionally, because gold citation texts are readily available in scientific documents, the citation text generation task requires no task-specific annotated training data. In practice, citation classification is used to assist in suggesting relevant works to researchers; citation text generation complements this goal by providing rationales for the recommendation and furthering progress toward explainable AI.

Generating a citation is also connected to summarizing scientific documents. There is a long history research on summarizing scientific documents (Luhn, 1958; Paice, 1980). More recently, researchers have included citing sentences as part of the input for summarization, hoping to capture the contribution of a work along with its content (Nakov et al., 2004; Cohan and Goharian, 2017; Yasunaga et al., 2019). Ours is the first to focus on the specific relationship between two documents when generating such sentences. Because of the emphasis on relational document understanding in our task, citation generation models can be used to assist with drafting papers as well, reducing researcher workload and providing non-native writers with a helpful first draft.

Our work builds on recent advances in transfer learning in NLP. In particular, large pretrained models such as BERT

(Devlin et al., 2018) and GPT2 (Radford et al., 2019) have made strong advances on a number of tasks (Wang et al., 2019). It has also been shown that pretraining these models on domain-specific data further improves results on domain-speicific tasks (Beltagy et al., 2019; Lee et al., 2019). In this work, we apply that methodology by adding an additional pretraining phase on in-domain data before finetuning a GPT2 model on the citation text generation task.

7 Conclusion

We have introduced the challenging but useful task of citation text generation. This task requires reasoning about the relationships between documents and expressing these relationships in natural language text. We have established a dataset for this task and studied the performance of contemporary neural text generation and information retrieval models. Our analysis shows that while these models produce fluent and topical outputs, more research is needed to ensure factual accuracy and specificity in the generated text.


This research was supported by the Office of Naval Research under the MURI grant N00014-18-1-2670.


  • I. Beltagy, K. Lo, and A. Cohan (2019) SciBERT: pretrained language model for scientific text. In EMNLP, External Links: arXiv:1903.10676 Cited by: §3.2, §3.3, §6.
  • C. Bhagavatula, S. Feldman, R. Power, and W. Ammar (2018) Content-based citation recommendation. In NAACL-HLT, Cited by: §1, §6.
  • L. Bornmann and R. Mutz (2015) Growth rates of modern science: a bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology 66 (11), pp. 2215–2222. External Links: ISSN 2330-1635, Link, Document Cited by: §1.
  • A. Cohan, W. Ammar, M. van Zuylen, and F. Cady (2019) Structural scaffolds for citation intent classification in scientific publications. In NAACL-HLT, Cited by: §1, §6.
  • A. Cohan and N. Goharian (2015) Scientific article summarization using citation-context and article’s discourse structure. In EMNLP, Cited by: §1.
  • A. Cohan and N. Goharian (2017) Contextualizing citations for scientific summarization using word embeddings and domain knowledge. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1133–1136. Cited by: §6.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §6.
  • A. Fan, M. Lewis, and Y. Dauphin (2018) Hierarchical neural story generation. In ACL, Cited by: §3.2.
  • D. Jurgens, S. Kumar, R. Hoover, D. A. McFarland, and D. Jurafsky (2016) Citation classification for behavioral analysis of a scientific field. ArXiv abs/1609.00435. Cited by: §6.
  • J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. Cited by: §6.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In ACL 2004, Cited by: §4.
  • K. Lo, L. L. Wang, M. Neumann, R. Kinney, and D. S. Weld (2019) GORC: a large contextual citation graph of academic papers. External Links: 1911.02782 Cited by: §2.
  • H. P. Luhn (1958) The automatic creation of literature abstracts. IBM Journal of Research and Development 2, pp. 159–165. Cited by: §6.
  • S. M. McNee, I. Albert, D. Cosley, P. Gopalkrishnan, S. K. Lam, A. M. Rashid, J. A. Konstan, and J. Riedl (2002) On the recommending of citations for research papers. In CSCW, Cited by: §6.
  • P. I. Nakov, A. S. Schwartz, and M. Hearst (2004) Citances: citation sentences for semantic analysis of bioscience text. In Proceedings of the SIGIR, Vol. 4, pp. 81–88. Cited by: §6.
  • R. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen (2008) Joint latent topic models for text and citations. In KDD, Cited by: §6.
  • C. D. Paice (1980) The automatic generation of literature abstracts: an approach based on the identification of self-indicating phrases. In SIGIR, Cited by: §6.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8). Cited by: §1, §3.1, §6.
  • M. Valenzuela, V. A. Ha, and O. Etzioni (2015) Identifying meaningful citations. In AAAI Workshop: Scholarly Big Data, Cited by: §1, §6.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.1.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019) SuperGLUE: a stickier benchmark for general-purpose language understanding systems. External Links: 1905.00537 Cited by: §6.
  • M. Yasunaga, J. Kasai, R. Zhang, A. R. Fabbri, I. Li, D. Friedman, and D. R. Radev (2019) ScisummNet: a large annotated corpus and content-impact models for scientific paper summarization with citation networks. In AAAI, Cited by: §1, §6.
  • R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, and Y. Choi (2019) Defending against neural fake news. ArXiv abs/1905.12616. Cited by: §3.1.
  • T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi (2019) BERTScore: evaluating text generation with bert. ArXiv abs/1904.09675. Cited by: §4.