Keyphrase extraction is the process of selecting phrases that capture the most salient topics in a document [Turney2002]. Keyphrases serve as an important piece of document metadata, often used in downstream tasks including information retrieval, document categorization, clustering and summarization. Keyphrase extraction has been applied to many different types of documents such as scientific papers [Kim et al.2010], news articles [Hulth and Megyesi2006], Web pages [Yih, Goodman, and Carvalho2006], and meeting transcripts [Liu et al.2009].
Classic techniques for keyphrase extraction involve a two stage approach [Hasan and Ng2014]: (1) candidate generation, and (2) pruning. During the first stage, the document is processed to extract a set of candidate keyphrases. In the second stage, this candidate set is pruned by selecting the most salient candidate keyphrases, using either supervised or unsupervised techniques. In the supervised setting, pruning is formulated as a binary classification problem: determine if a given candidate is a keyphrase. In the unsupervised setting, pruning is treated as a ranking problem, where the candidates are ranked based on some measure of importance and those below a particular threshold are discarded.
Challenges - Researchers typically employ a combination of different techniques for generating candidate keyphrases such as extracting named entities, finding noun phrases that adhere to pre-defined lexical patterns [Barker and Cornacchia2000]
, or extracting n-grams that appear in an external knowledge base like Wikipedia[Grineva, Grinev, and Lizorkin2009]
. The candidates are further cleaned up using stop word lists or gazetteers. Errors in any of these techniques reduces the quality of candidate keyphrases. For example, if a named entity is not identified as such, it misses out on being considered as a keyphrase; if there are errors in part of speech tagging, extracted noun phrases might be incomplete. Also, since candidate generation involves a combination of heuristics with specific parameters, thresholds, and external resources, it is hard to reproduce any particular result or migrate implementations to new domains.
Motivation - Recently, researchers have started to approach keyphrase extraction as a sequence labeling task, where each token in the document is tagged as either being a part of a keyphrase or not [Gollapalli and Li2016, Alzaidy, Caragea, and Giles2019]
. There are many advantages to this new formulation. First, it completely bypasses the candidate generation stage and provides a unified approach to keyphrase extraction. Second, unlike binary classification where each keyphrase is classified independently, sequence labeling finds an optimal assignment of keyphrase labels for the entire document. Lastly, sequence labeling allows to capture long-term semantic dependencies in the document, which are known to be prevalent in natural language.
More recently, there have been significant advances in deep contextual language models such as ELMo [Peters et al.2018], and BERT [Devlin et al.2019]. These models can take an input text and provide contextual embeddings for each token for use in downstream architectures, or can perform task-specific fine-tuning. They have been shown to achieve state-of-the-art results for many different NLP tasks like document classification, question answering, dependency parsing, etc. More recent works [Beltagy, Cohan, and Lo2019, Lee et al.2019] have shown that contextual embedding models trained on domain- or genre-specific corpora can outperform general purpose models that are usually trained on Wikipedia text.
Contributions - Despite all these developments, to the best of our knowledge, no recent studies show the use of contextual embeddings for keyphrase extraction. We expect that, as with other NLP tasks, keyphrase extraction can benefit from the use of contextual embeddings. We also posit that genre-specific language models may further help improve performance. To explore these hypotheses, in this paper, we approach keyphrase extraction as a sequence labeling task solved using a BiLSTM-CRF, where the underlying words in the input sequence are represented using various contextual embedding architectures. The following are the main contributions of this paper:
We quantify the benefits of using deep contextual embedding models (BERT, SciBERT, OpenAI GPT, ELMo, RoBERTa, Transformer XL and OpenAI GPT-2) for sequence-labeling-based keyphrase extraction from scientific text over using fixed word embedding models (word2vec, Glove and FastText).
We demonstrate the benefits of using a BiLSTM-CRF architecture with contextualized word embeddings over fine-tuning the contextualized word embedding model to the keyphrase extraction task.
We demonstrate improvements using contextualized word embeddings that are trained on a large corpus of in-genre text (SciBERT) over ones trained on generic text (BERT).
We perform a robust set of experiments on three benchmark datasets (Inspec, SemEval-2010, SemEval-2017), and achieve state-of-the-art results. We compare the performance of these models with popular baseline unsupervised and supervised techniques.
We perform a thorough error analysis and provide insights into how particular contextual embeddings work better than others and into the working of the different self-attention layers of our top-performing models.
We process the existing benchmark datasets using a B-I-O tagging scheme thus making them more suitable for building sequence-labeling-based models. We make the processed datasets, our source code, and our trained models publicly available111www.anonymous.com for the benefit of the research community.
The rest of the paper is organized as follows. Section 2 presents prior work on keyphrase extraction and recent developments in deep contextualized language models. Section 3 details the BiLSTM-CRF architecture. Section 4 describes our experiments. Sections 5 and 6 present experimental results and an attention analysis.
2 Background and Related Work
2.1 Keyphrase extraction
The task of automated keyphrase extraction has attracted attention from researchers for nearly 20 years [Frank et al.1999]
. Over this time, researchers have developed a wide array of both supervised and unsupervised techniques. In the supervised setting, keyphrase extraction is treated as a binary classification problem, with annotated keyphrases serving as positive examples and all other phrases as negative examples. Supervised techniques employ a machine learning model to determine if a given candidate phrase is a keyphrase based on textual features such as term frequencies[Hulth2003], syntactic properties [Kim and Kan2009], or location information [Nguyen and Kan2007]. More features could be included with the use of external resources like document citations [Caragea et al.2014] or hyperlinks [Kelleher and Luz2005]
. Different classification algorithms have been used, including naive bayes[Witten et al.2005]Turney2000], bagging [Hulth2003], boosting [Hulth et al.2001]Lopez and Romary2010], and SVMs [Zhao et al.2011].
Popular unsupervised methods such as TextRank[Mihalcea and Tarau2004], LexRank [Erkan and Radev2004], TopicRank [Bougouin, Boudin, and Daille2013], SGRank [Danesh, Sumner, and Martin2015], and SingleRank [Wan and Xiao2008], directly leverage the graph-based ranking algorithm PageRank [Page et al.1999], with a combination of other heuristics based on tf-idf scores, word co-occurrence measures, extraction of specific lexical patterns, and clustering. Recently, several works [Mahata et al.2018a, Wang, Liu, and McDonald2015, Mahata et al.2018b], have shown the effectiveness of word embeddings in unsupervised keyphrase extraction.
In the presence of domain-specific data, supervised methods have shown better performance. Unsupervised methods have the advantage of not requiring any training data and can produce results in any domain. However, the assumptions of unsupervised methods do not hold for every type of document. On the other hand, casting keyphrase extraction as a binary classification problem has its disadvantages since each candidate phrase is labeled independent of all others.
Gollapalli et al. [Gollapalli and Li2016] was one of the first works to approach keyphrase extraction as a sequence labeling task. They used Conditional Random Fields (CRF) with many textual features such as tf-idf of the terms, orthographic information, POS tags, and positional information. Alzidy et al. [Alzaidy, Caragea, and Giles2019] used BiLSTM-CRFs, where the words were represented using fixed word embeddings like Glove embeddings [Pennington, Socher, and Manning2014]. In this paper, we further explore sequence labeling techniques for keyphrase extraction using contextualized word embeddings.
2.2 Contextual Embeddings
ELMo [Peters et al.2018]
uses stacked bidirectional LSTMs with a residual connection to model sentences and the tokens are represented using character-level CNNs. BERT[Devlin et al.2019] and OpenAI GPT [Radford et al.2018] do away with the use of LSTMs and instead employ multi-layer Transformers [Vaswani et al.2017] for sentence modeling. The main difference between BERT and OpenAI GPT is that the former uses a bidirectional transformer while the latter uses a left-to-right Transformer. OpenAI GPT-2 model [Radford et al.2019] was proposed as a direct successor of the GPT model, and is trained on 10x more data than the original GPT, with better performances and an additional benefit of working in a zero-shot transfer setting. RoBERTa [Liu et al.2019] was recently published as a replication study on BERT, where a new larger dataset was used for training, using more iterations, and removing the next sequence prediction training objective from the original model. This helped RoBERTa to achieve state-of-the-art performance on different benchmark datasets making it equivalent in performance with other contextual language models like Transformer XL [Dai et al.2019], and XLNet [Yang et al.2019]. More recent works, like SciBERT [Beltagy, Cohan, and Lo2019], have used the BERT architecture to build domain-specific language models. SciBERT was trained on a corpus of 1.14 million scientific papers mostly from computer science and biomedical domains. It has achieved state of the art results on many scientific NLP tasks including NER, document classification, and dependency parsing.
In this work, we combine the benefits of formulating keyphrase extraction as a sequence labeling task with the rich representation of language by contextual embeddings. To our knowledge, this is the first work which attempts a robust evaluation of the performance of a BiLSTM-CRF architecture using contextual embeddings for keyphrase extraction. We also present a thorough comparison of the performance of different word embedding models.
We approach the problem of automated keyphrase extraction from scholarly articles as a sequence labeling task, which can be formally stated as: Let be an input text, where represents the token. Assign each in the document one of three class labels , where denotes that marks the beginning of a keyprahse, means that is inside a keyphrase, and indicates that is not part of a keyphrase.
In this paper, we employ a BiLSTM-CRF architecture to solve this sequence labeling problem. LSTMs [Gers, Schmidhuber, and Cummins1999]
are recurrent neural networks that deal with vanishing and exploding gradient problems with the use of gated architectures. Bidirectional LSTMs (BiLSTM) are generalization of LSTMs that capture long-distance dependencies between words in both directions.
We first map each token
in the input text to a fixed-size dense vector, thus is represented as sequence of vectors . The corresponding class labels are , where . We then use a BiLSTM to encode sequential relations between the word representations. A LSTM unit consists of four main components: input gate (), forget gate (), memory cell (), and output gate (), which are defined as below:
In the above equations,
denotes the sigmoid function,the hyperbolic tangent function, and an element-wise dot product. and
are model parameters that are estimated during training, andis the hidden state.
In a bidirectional LSTM, we apply equations 1 to 5 in both directions to create two hidden state vectors and , where provides a representation for word by incorporating information from the preceding words , and builds a representation for word by capturing information from the succeeding words . By concatenating and , we get a vector representation of in the context of the input text .
We then apply an affine transformation to map the output from the BiLSTM to the class space:
where is a matrix of size and .
The score outputs from the BiLSTM serve as input to a CRF layer. CRFs [Lafferty, McCallum, and Pereira2001] are discriminative probabilistic models that have been used in many sequence tagging problem in NLP. CRFs when used in conjunction with deep learning models [Huang, Xu, and Yu2015] have been shown to improve the performance of many sequence labeling tasks.
In a CRF, the score of an output label sequence is:
is a transition matrix where represents the transition score from class to . The likelihood for a labeling sequence is generated by exponentiating the scores and normalizing over all possible output label sequences.
During inference, CRFs use the Viterbi algorithm to efficiently find the optimal sequence of labels. The entire architecture is summarized in Figure 1. As a baseline, we also experiment with a plain BiLSTM architecture.
Datasets - We ran our experiments on three different publicly available keyphrase extraction datasets: Inspec [Hulth2003], SemEval-2010 [Kim et al.2010] (hereafter, SE-2010), and SemEval-2017 [Augenstein et al.2017] (hereafter, SE-2017). Inspec consists of abstracts from 2000 scientific articles divided into train, validation and test sets containing 1000, 500, and 500 abstracts respectively. Each abstract is accompanied by two sets of human-annotated keyphrases: controlled - as assigned by the authors, and uncontrolled - assigned by the readers. Controlled keyphrases are mostly abstractive, i.e., not present in the abstracts, whereas the uncontrolled keyphrases are mostly extractive. SE-2010 consists of 284 full length ACM articles divided into train, trial and test splits containing 144, 40, and 100 articles respectively. SE-2010 also has author-assigned (controlled) and reader-assigned (uncontrolled) keyphrases. SE-2017 consists of 500 open access articles published in ScienceDirect divided into train, dev and test sets containing 350, 50, and 100 articles respectively. Unlike the other two datasets, SE-2017 provides location spans for all keyphrases, i.e. all keyphrases are extractive.
Because we are modeling keyphrase extraction as a sequence labeling task, we only consider extractive keyphrases that are present in the article abstracts in each data set. For Inspec and SE-2010, we automatically identified the location spans for each extractive keyphrase. We discarded the full text articles from SE-2010 and SE-2017 due to memory constraints during training and inference with the contextual embedding models on large documents. We also discarded the trial dataset from SE-2010 and instead randomly chose 14 documents from the train split for validation.
All the tokens in each dataset were tagged using the B-I-O tagging scheme described in the problem statement in the previous section. We plan to release this processed dataset along with this publication. Table 1 provides some general statistics on the processed dataset used in this paper.
Experimental Settings - One of the main aims of this work is to study the effectiveness of contextual embeddings in keyphrase extraction. To this end, we use the BiLSTM-CRF and BiLSTM architectures with seven different pre-trained contextual embeddings: BERT (small-cased, small-uncased, large-cased, large-uncased), SciBERT (basevocab-cased, basevocab-uncased, scivocab-cased, scivocab-uncased), OpenAI GPT, ELMo, RoBERTa (base, large), Transformer XL, and OpenAI GPT-2 (small, medium). As a baseline, we also use 300 dimensional fixed embeddings from Glove222https://nlp.stanford.edu/projects/glove/, Word2Vec333https://github.com/mmihaltz/word2vec-GoogleNews-vectors, and FastText444https://fasttext.cc/docs/en/english-vectors.html (common-crawl, wiki-news). We also compare the proposed architecture against four popular baseline keyphrase extraction techniques: SGRank, SingleRank, Textrank, and KEA. Of these, the first three are unsupervised while KEA is a supervised technique.
We trained BiLSTM-CRF and BiLSTM models using Stochastic Gradient Descent (SGD) with Nesterov momentum in batched mode. Due to system memory constraints the batch size was set to 4. The learning rate was set to 0.05 and the models were trained for a total of 100 epochs with patience value of 4 and annealing factor of 0.5; i.e., if the model performance did not improve for 4 epochs, then the learning rate would be reduced by a factor of 0.5. The hidden layers in the BiLSTM models were set to 128 units and word dropout set to 0.05. The token representations obtained using different embeddings are not tuned during the training process. During inference, we run the model on a given abstract and identify keyphrases as all sequences of class labels that begin with the tagfollowed by zero or more tokens tagged . As used by previous studies [Kim et al.2010], we use Precision, Recall, and F1-measure based on actual matches against the ground-truth for evaluating the different approaches, and use the F1-measure to compare between different models555Due to space constraints we only report the F-measure in the paper..
In this section, we report the main observations for the different experiments that we performed. For each embedding model we report results for the best performing variant of that model (e.g. cased vs uncased) on each dataset. Overall, BERT and SciBERT models are the best-performing embedding models, and the BiLSTM-CRF architecture is the best architecture, consistently across all the datasets.
CRF layer - Table 2 presents a comparison of the BiLSTM and BiLSTM-CRF architectures in terms of F1-scores for three models: SciBERT, BERT, and ELMo. The addition of the CRF layer improved the performance for all datasets. However, it was most effective for ELMo. For example, on the SE-2010 data with ELMO, the CRF layer increased the F1-score by nearly 50% from 0.157 to 0.225. An analysis of results on the SemEval-2017 data shows that the CRF layers is more effective in capturing keyphrases that include prepositions (e.g. ‘of’), conjunctions (e.g. ‘and’), and articles (e.g. ‘the’). We also observed that the CRF layer is more accurate with longer keyphrases (more than two tokens).
Fine-tuning - Contextualized embedding models can be used in two ways: (1) they can serve as numerical representations of words that are to be used in downstream architectures, or (2) they can be fine-tuned to be optimized for a specific task. Fine-tuning typically involves adding an untrained layer at the end and then optimizing the layer weights for the task-specific objective. We fine-tuned our best-performing contextualized embedding models (BERT and SciBERT) for each dataset and compared with the performance of the corresponding BiLSTM-CRF models when used with the same pre-trained embeddings. The results are summarized in Table 3. The BiLSTM-CRF outperforms contextual embedding model fine-tuning across all datasets, for both BERT and SciBERT. We think this might be due to the small sizes of the datasets on which the models are fine-tuned.
5.2 Contextual embeddings
Here we want to understand which of the various contextual embedding models that were discussed in Section 2
is best suited for this task. These models vary in architecture, training data and hyperparameter choices. Table4 presents the performance of the best variant (in our experiments) of the seven contextual embedding models and three fixed embedding models, using the BiLSTM-CRF architecture.
Of the ten embedding architectures, BERT or BERT-based models consistently obtained the best performance across all datasets. This was expected considering that BERT uses bidirectional pre-training which is more powerful. SciBERT was consistently one of the top performing models and was significantly better than any of the other models on SemEval-2010. Further analysis of the results on SemEval-2010 shows that SciBERT was more accurate than other models in capturing keyphrases that contained scientific terms such as chemical names (e.g. ‘Magnesium’, ‘Hydrozincite’), software projects (e.g. ‘HemeLB’), and abbreviations (e.g. ‘DSP’, ‘SIMLIB’). SciBERT was also more accurate with keyphrases containing more than three tokens. The differences in training vocabulary is apparent where SciBERT is able to classify scientific nouns (e.g. ‘real time’, ‘chip’, ‘transform’).
Contextual embeddings outperformed their fixed counterparts for most of the experimental scenarios. The only exception was on SemEval-2010 where FastText outperformed Transformer-XL. Of the three fixed embedding models studied in this paper, FastText obtained the best performance across all datasets.
We also compare the training (Figure 2) and validation loss (Figure 3) of BiLSTM-CRF models on four embeddings: SciBERT, BERT, FastText, and Word2Vec. These figures show that the loss values have reduced faster for contextual models than fixed embeddings. Also, SciBERT’s validation loss converged faster than BERT. This is consistent with some of the findings in transfer learning literature [Goldberg2017], which has shown that these pre-trained contextual models with rich linguistic information can easily adapt to new domains.
5.3 Baseline comparisons
Lastly, we compare our best performing model (BiLSTM-CRF with SciBERT embeddings) against four baseline methods: SGRank, SingleRank, TextRank, and KEA. Table 5 presents the results. As expected, our model significantly outperforms all the baseline methods for all three datasets. Of the four baseline methods evaluated here, SGRank achieved the best results. This observation holds true irrespective of the embeddings used. For example, on Inpsec data, the worst performing BiLSTM-CRF model (Glove at 0.457) is still significantly better than the best performing baseline model (SGRank at 0.271).
For SemEval-2017, the best reported F1 score is 0.55 [Ammar et al.2017]. Ammar et al’s work makes use of keyphrase type information in their modeling (if it is a task, material, or process) and they also employs some task specific gazetteers and therefore not directly comparable to our results. Likewise, to our knowledge, the best reported F1 score for SemEval-2010 is 0.29 [Mahata et al.2018a] but those models use the entire articles. Though these numbers are not directly comparable to our work, we reported them here to provide a complete context. For Inspec, Mahata et al. [Mahata et al.2018a] reported an F1 score of 0.52 which is significantly outperformed by our best model at 0.59.
6 Case study: Attention Analysis
Attention analysis is used to understand if neural network-internal attention mechanisms provide any insight into the linguistic properties learned by the models. We present a case study of attention analysis for keyphrase extraction on a randomly chosen abstract from SemEval2017.
|An object-oriented version of SIMLIB -LRB- a simple simulation package -RRB- This paper introduces an object-oriented version of SIMLIB -LRB- an easy-to-understand discrete-event simulation package -RRB- . The object-oriented version is preferable to the original procedural language versions of SIMLIB in that it is easier to understand and teach simulation from an object point of view . A single-server queue simulation is demonstrated using the object-oriented SIMLIB||An object-oriented version of SIMLIB -LRB- a simple simulation package -RRB- This paper introduces an object-oriented version of SIMLIB -LRB- an easy-to-understand discrete-event simulation package -RRB- . The object-oriented version is preferable to the original procedural language versions of SIMLIB in that it is easier to understand and teach simulation from an object point of view . A single-server queue simulation is demonstrated using the object-oriented SIMLIB|
Table 6 presents the classification results on this abstract from the BERT and SciBERT models; true positives are marked in green and false negatives in red. Using BertViz [Vig2019] we analyzed the aggregated attention of all 12 layers of both models. We observed that keyphrase tokens ( and ) typically tend to pay most attention towards other keyphrase tokens. Contrarily, non-keyphrase tokens () usually pay uniform attention to their surrounding tokens. We found that both BERT and SciBERT exhibit similar attention patterns in the initial and final layers but they vary significantly in the middle layers. For example, Figure 5 compares the attention patterns in the fifth layer of both models. In SciBERT, the token ‘object’ is very strongly linked to other tokens from its keyphrase but the attentions are comparably weaker for BERT.
We also observed that keyphrase tokens paid strong attention to similar tokens from other keyphrases. For example, as shown in Figure 5, the token ‘version’ from ‘object-oriented version’ pays strong attention to ‘versions’ from ‘procedural language versions’. This is a possible reason for both models failing to identify the third mention of ‘object-oriented version’ in the abstract as a keyphrase. We observed similar patterns in many other documents through our attention analysis. In future work, we plan to quantify this analysis over multiple documents.
In this paper, we formulate keyphrase extraction as a sequence labeling task solved using BiLSTM-CRFs, where the underlying words are represented using various contextualized embedding models. Through our experimental work, we quantify the benefits of this architecture over direct fine tuning of the embedding models. We also demonstrate how contextual embeddings significantly outperform their fixed counterparts in keyphrase extraction, with BERT based models performing the best. We also performed attention analysis on a sample scientific abstract to build an intuitive understanding of the working of the self-attention layers of BERT and SciBERT.
Our approach only deals with the problem of keyphrase extraction but not generation. In the future, we plan to use some of the findings from this paper to help build keyphrase generation models. It would also be beneficial to look at the working of self-attention layers with greater detail and study the fine-tuning capabilities of the contextual embedding models on bigger datasets for the task of keyphrase extraction. We also expect some of the findings in this paper could prove useful to other NLP problems like document summarization.
- [Alzaidy, Caragea, and Giles2019] Alzaidy, R.; Caragea, C.; and Giles, C. L. 2019. Bi-LSTM-CRF sequence labeling for keyphrase extraction from scholarly documents. In Proc. WWW.
- [Ammar et al.2017] Ammar, W.; Peters, M.; Bhagavatula, C.; and Power, R. 2017. The ai2 system at semeval-2017 task 10 (scienceie): semi-supervised end-to-end entity and relation extraction. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), 592–596.
- [Augenstein et al.2017] Augenstein, I.; Das, M.; Riedel, S.; Vikraman, L.; and McCallum, A. 2017. SemEval 2017 task 10: ScienceIE-extracting keyphrases and relations from scientific publications. In Proc. SemEval.
- [Barker and Cornacchia2000] Barker, K., and Cornacchia, N. 2000. Using noun phrase heads to extract document keyphrases. In Proc. Canadian Conference on AI.
- [Beltagy, Cohan, and Lo2019] Beltagy, I.; Cohan, A.; and Lo, K. 2019. Scibert: Pretrained contextualized embeddings for scientific text. arXiv:1903.10676.
- [Bougouin, Boudin, and Daille2013] Bougouin, A.; Boudin, F.; and Daille, B. 2013. Topicrank: Graph-based topic ranking for keyphrase extraction. In Proc. IJCNLP.
- [Caragea et al.2014] Caragea, C.; Bulgarov, F. A.; Godea, A.; and Gollapalli, S. D. 2014. Citation-enhanced keyphrase extraction from research papers: A supervised approach. In Proc. EMNLP.
- [Dai et al.2019] Dai, Z.; Yang, Z.; Yang, Y.; Cohen, W. W.; Carbonell, J.; Le, Q. V.; and Salakhutdinov, R. 2019. Transformer-XL: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860.
- [Danesh, Sumner, and Martin2015] Danesh, S.; Sumner, T.; and Martin, J. H. 2015. SGRank: Combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction. In Proc. Joint Conference on Lexical and Computational Semantics.
- [Devlin et al.2019] Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL-HLT.
[Erkan and Radev2004]
Erkan, G., and Radev, D. R.
LexRank: Graph-based lexical centrality as salience in text
Journal of Artificial Intelligence Research22:457–479.
- [Frank et al.1999] Frank, E.; Paynter, G. W.; Witten, I. H.; Gutwin, C.; and Nevill-Manning, C. G. 1999. Domain-specific keyphrase extraction. In Proc. IJCAI.
- [Gers, Schmidhuber, and Cummins1999] Gers, F. A.; Schmidhuber, J.; and Cummins, F. 1999. Learning to forget: Continual prediction with LSTM. In Proc. ICANN.
- [Goldberg and Levy2014] Goldberg, Y., and Levy, O. 2014. word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722.
Neural network methods for natural language processing.Synthesis Lectures on Human Language Technologies 10(1):1–309.
- [Gollapalli and Li2016] Gollapalli, S. D., and Li, X.-l. 2016. Keyphrase extraction using sequential labeling. arXiv preprint arXiv:1608.00329.
- [Grineva, Grinev, and Lizorkin2009] Grineva, M.; Grinev, M.; and Lizorkin, D. 2009. Extracting key terms from noisy and multitheme documents. In Proc. WWW.
- [Hasan and Ng2014] Hasan, K. S., and Ng, V. 2014. Automatic keyphrase extraction: A survey of the state of the art. In Proc. ACL.
- [Huang, Xu, and Yu2015] Huang, Z.; Xu, W.; and Yu, K. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR abs/1508.01991.
- [Hulth and Megyesi2006] Hulth, A., and Megyesi, B. B. 2006. A study on automatically extracted keywords in text categorization. In Proc. COLING-ACL.
- [Hulth et al.2001] Hulth, A.; Karlgren, J.; Jonsson, A.; Boström, H.; and Asker, L. 2001. Automatic keyword extraction using domain knowledge. In Proc. International Conference on Intelligent Text Processing and Computational Linguistics.
- [Hulth2003] Hulth, A. 2003. Improved automatic keyword extraction given more linguistic knowledge. In Proc. EMNLP.
- [Joulin et al.2017] Joulin, A.; Grave, E.; Bojanowski, P.; and Mikolov, T. 2017. Bag of tricks for efficient text classification. In Proc. EACL.
- [Kelleher and Luz2005] Kelleher, D., and Luz, S. 2005. Automatic hypertext keyphrase detection. In Proc. ICJAI.
- [Kim and Kan2009] Kim, S. N., and Kan, M.-Y. 2009. Re-examining automatic keyphrase extraction approaches in scientific articles. In Proc. ACL 2009 Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications.
- [Kim et al.2010] Kim, S. N.; Medelyan, O.; Kan, M.-Y.; and Baldwin, T. 2010. SemEval-2010 task 5: Automatic keyphrase extraction from scientific articles. In Proc. SemEval.
- [Lafferty, McCallum, and Pereira2001] Lafferty, J.; McCallum, A.; and Pereira, F. C. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML.
- [Lee et al.2019] Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C. H.; and Kang, J. 2019. BioBERT: pre-trained biomedical language representation model for biomedical text mining. arXiv preprint arXiv:1901.08746.
- [Liu et al.2009] Liu, F.; Pennell, D.; Liu, F.; and Liu, Y. 2009. Unsupervised approaches for automatic keyword extraction using meeting transcripts. In Proc. NAACL-HLT.
- [Liu et al.2019] Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
- [Lopez and Romary2010] Lopez, P., and Romary, L. 2010. HUMB: Automatic key term extraction from scientific articles in GROBID. In Proc. SemEval.
- [Mahata et al.2018a] Mahata, D.; Kuriakose, J.; Shah, R. R.; and Zimmermann, R. 2018a. Key2Vec: Automatic ranked keyphrase extraction from scientific articles using phrase embeddings. In Proc. NAACL-HLT.
- [Mahata et al.2018b] Mahata, D.; Shah, R. R.; Kuriakose, J.; Zimmermann, R.; and Talburt, J. R. 2018b. Theme-weighted ranking of keywords from text documents using phrase embeddings. In Proc. IEEE Conference on Multimedia Information Processing and Retrieval (MIPR).
- [Mihalcea and Tarau2004] Mihalcea, R., and Tarau, P. 2004. TextRank: Bringing order into text. In Proc. EMNLP.
- [Nguyen and Kan2007] Nguyen, T. D., and Kan, M.-Y. 2007. Keyphrase extraction in scientific publications. In Goh, D. H.-L.; Cao, T. H.; Sølvberg, I. T.; and Rasmussen, E., eds., Asian Digital Libraries. Looking Back 10 Years and Forging New Frontiers, 317–326. Berlin, Heidelberg: Springer Berlin Heidelberg.
- [Page et al.1999] Page, L.; Brin, S.; Motwani, R.; and Winograd, T. 1999. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab.
- [Pennington, Socher, and Manning2014] Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In Proc. EMNLP.
- [Peters et al.2018] Peters, M. E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365.
- [Radford et al.2018] Radford, A.; Narasimhan, K.; Salimans, T.; and Sutskever, I. 2018. Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
- [Radford et al.2019] Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language models are unsupervised multitask learners. OpenAI Blog.
- [Turney2000] Turney, P. D. 2000. Learning algorithms for keyphrase extraction. Information Retrieval 2(4):303–336.
- [Turney2002] Turney, P. D. 2002. Learning to extract keyphrases from text. arXiv preprint cs/0212013.
- [Vaswani et al.2017] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; and Polosukhin, I. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, 5998–6008.
- [Vig2019] Vig, J. 2019. A multiscale visualization of attention in the transformer model. arXiv preprint arXiv:1906.05714.
- [Wan and Xiao2008] Wan, X., and Xiao, J. 2008. Single document keyphrase extraction using neighborhood knowledge. In Proc. AAAI.
- [Wang, Liu, and McDonald2015] Wang, R.; Liu, W.; and McDonald, C. 2015. Using word embeddings to enhance keyword identification for scientific publications. In Proc. Australasian Database Conference.
- [Witten et al.2005] Witten, I. H.; Paynter, G. W.; Frank, E.; Gutwin, C.; and Nevill-Manning, C. G. 2005. Kea: Practical automated keyphrase extraction. In Design and Usability of Digital Libraries: Case Studies in the Asia Pacific. IGI Global. 129–152.
- [Yang et al.2019] Yang, Z.; Dai, Z.; Yang, Y.; Carbonell, J.; Salakhutdinov, R.; and Le, Q. V. 2019. XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237.
- [Yih, Goodman, and Carvalho2006] Yih, W.-t.; Goodman, J.; and Carvalho, V. R. 2006. Finding advertising keywords on web pages. In Proc. WWW.
- [Zhao et al.2011] Zhao, W. X.; Jiang, J.; He, J.; Song, Y.; Achananuparp, P.; Lim, E.-P.; and Li, X. 2011. Topical keyphrase extraction from twitter. In Proc. ACL.