Keyphrase Extraction from Scholarly Articles as Sequence Labeling using Contextualized Embeddings

10/19/2019
by   Dhruva Sahrawat, et al.
0

In this paper, we formulate keyphrase extraction from scholarly articles as a sequence labeling task solved using a BiLSTM-CRF, where the words in the input text are represented using deep contextualized embeddings. We evaluate the proposed architecture using both contextualized and fixed word embedding models on three different benchmark datasets (Inspec, SemEval 2010, SemEval 2017) and compare with existing popular unsupervised and supervised techniques. Our results quantify the benefits of (a) using contextualized embeddings (e.g. BERT) over fixed word embeddings (e.g. Glove); (b) using a BiLSTM-CRF architecture with contextualized word embeddings over fine-tuning the contextualized word embedding model directly, and (c) using genre-specific contextualized embeddings (SciBERT). Through error analysis, we also provide some insights into why particular models work better than others. Lastly, we present a case study where we analyze different self-attention layers of the two best models (BERT and SciBERT) to better understand the predictions made by each for the task of keyphrase extraction.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/03/2019

Aspect Detection using Word and Char Embeddings with (Bi)LSTM and CRF

We proposed a new accurate aspect extraction method that makes use of bo...
07/26/2019

Investigating Self-Attention Network for Chinese Word Segmentation

Neural network has become the dominant method for Chinese word segmentat...
11/29/2020

Improved Semantic Role Labeling using Parameterized Neighborhood Memory Adaptation

Deep neural models achieve some of the best results for semantic role la...
11/13/2021

Keyphrase Extraction Using Neighborhood Knowledge Based on Word Embeddings

Keyphrase extraction is the task of finding several interesting phrases ...
03/24/2018

Equation Embeddings

We present an unsupervised approach for discovering semantic representat...
12/16/2019

Predicting the Outcome of Judicial Decisions made by the European Court of Human Rights

In this study, machine learning models were constructed to predict wheth...
11/19/2019

Towards non-toxic landscapes: Automatic toxic comment detection using DNN

The spectacular expansion of the Internet led to the development of a ne...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Keyphrase extraction is the process of selecting phrases that capture the most salient topics in a document [Turney2002]. Keyphrases serve as an important piece of document metadata, often used in downstream tasks including information retrieval, document categorization, clustering and summarization. Keyphrase extraction has been applied to many different types of documents such as scientific papers [Kim et al.2010], news articles [Hulth and Megyesi2006], Web pages [Yih, Goodman, and Carvalho2006], and meeting transcripts [Liu et al.2009].

Classic techniques for keyphrase extraction involve a two stage approach [Hasan and Ng2014]: (1) candidate generation, and (2) pruning. During the first stage, the document is processed to extract a set of candidate keyphrases. In the second stage, this candidate set is pruned by selecting the most salient candidate keyphrases, using either supervised or unsupervised techniques. In the supervised setting, pruning is formulated as a binary classification problem: determine if a given candidate is a keyphrase. In the unsupervised setting, pruning is treated as a ranking problem, where the candidates are ranked based on some measure of importance and those below a particular threshold are discarded.

Challenges - Researchers typically employ a combination of different techniques for generating candidate keyphrases such as extracting named entities, finding noun phrases that adhere to pre-defined lexical patterns [Barker and Cornacchia2000]

, or extracting n-grams that appear in an external knowledge base like Wikipedia

[Grineva, Grinev, and Lizorkin2009]

. The candidates are further cleaned up using stop word lists or gazetteers. Errors in any of these techniques reduces the quality of candidate keyphrases. For example, if a named entity is not identified as such, it misses out on being considered as a keyphrase; if there are errors in part of speech tagging, extracted noun phrases might be incomplete. Also, since candidate generation involves a combination of heuristics with specific parameters, thresholds, and external resources, it is hard to reproduce any particular result or migrate implementations to new domains.

Motivation - Recently, researchers have started to approach keyphrase extraction as a sequence labeling task, where each token in the document is tagged as either being a part of a keyphrase or not [Gollapalli and Li2016, Alzaidy, Caragea, and Giles2019]

. There are many advantages to this new formulation. First, it completely bypasses the candidate generation stage and provides a unified approach to keyphrase extraction. Second, unlike binary classification where each keyphrase is classified independently, sequence labeling finds an optimal assignment of keyphrase labels for the entire document. Lastly, sequence labeling allows to capture long-term semantic dependencies in the document, which are known to be prevalent in natural language.

More recently, there have been significant advances in deep contextual language models such as ELMo [Peters et al.2018], and BERT [Devlin et al.2019]. These models can take an input text and provide contextual embeddings for each token for use in downstream architectures, or can perform task-specific fine-tuning. They have been shown to achieve state-of-the-art results for many different NLP tasks like document classification, question answering, dependency parsing, etc. More recent works [Beltagy, Cohan, and Lo2019, Lee et al.2019] have shown that contextual embedding models trained on domain- or genre-specific corpora can outperform general purpose models that are usually trained on Wikipedia text.

Contributions - Despite all these developments, to the best of our knowledge, no recent studies show the use of contextual embeddings for keyphrase extraction. We expect that, as with other NLP tasks, keyphrase extraction can benefit from the use of contextual embeddings. We also posit that genre-specific language models may further help improve performance. To explore these hypotheses, in this paper, we approach keyphrase extraction as a sequence labeling task solved using a BiLSTM-CRF, where the underlying words in the input sequence are represented using various contextual embedding architectures. The following are the main contributions of this paper:

We quantify the benefits of using deep contextual embedding models (BERT, SciBERT, OpenAI GPT, ELMo, RoBERTa, Transformer XL and OpenAI GPT-2) for sequence-labeling-based keyphrase extraction from scientific text over using fixed word embedding models (word2vec, Glove and FastText).

We demonstrate the benefits of using a BiLSTM-CRF architecture with contextualized word embeddings over fine-tuning the contextualized word embedding model to the keyphrase extraction task.

We demonstrate improvements using contextualized word embeddings that are trained on a large corpus of in-genre text (SciBERT) over ones trained on generic text (BERT).

We perform a robust set of experiments on three benchmark datasets (Inspec, SemEval-2010, SemEval-2017), and achieve state-of-the-art results. We compare the performance of these models with popular baseline unsupervised and supervised techniques.

We perform a thorough error analysis and provide insights into how particular contextual embeddings work better than others and into the working of the different self-attention layers of our top-performing models.

We process the existing benchmark datasets using a B-I-O tagging scheme thus making them more suitable for building sequence-labeling-based models. We make the processed datasets, our source code, and our trained models publicly available111www.anonymous.com for the benefit of the research community.

The rest of the paper is organized as follows. Section 2 presents prior work on keyphrase extraction and recent developments in deep contextualized language models. Section 3 details the BiLSTM-CRF architecture. Section 4 describes our experiments. Sections 5 and 6 present experimental results and an attention analysis.

2 Background and Related Work

2.1 Keyphrase extraction

The task of automated keyphrase extraction has attracted attention from researchers for nearly 20 years [Frank et al.1999]

. Over this time, researchers have developed a wide array of both supervised and unsupervised techniques. In the supervised setting, keyphrase extraction is treated as a binary classification problem, with annotated keyphrases serving as positive examples and all other phrases as negative examples. Supervised techniques employ a machine learning model to determine if a given candidate phrase is a keyphrase based on textual features such as term frequencies

[Hulth2003], syntactic properties [Kim and Kan2009], or location information [Nguyen and Kan2007]. More features could be included with the use of external resources like document citations [Caragea et al.2014] or hyperlinks [Kelleher and Luz2005]

. Different classification algorithms have been used, including naive bayes

[Witten et al.2005]

, decision trees

[Turney2000], bagging [Hulth2003], boosting [Hulth et al.2001]

, neural networks

[Lopez and Romary2010], and SVMs [Zhao et al.2011].

Popular unsupervised methods such as TextRank

[Mihalcea and Tarau2004], LexRank [Erkan and Radev2004], TopicRank [Bougouin, Boudin, and Daille2013], SGRank [Danesh, Sumner, and Martin2015], and SingleRank [Wan and Xiao2008], directly leverage the graph-based ranking algorithm PageRank [Page et al.1999], with a combination of other heuristics based on tf-idf scores, word co-occurrence measures, extraction of specific lexical patterns, and clustering. Recently, several works [Mahata et al.2018a, Wang, Liu, and McDonald2015, Mahata et al.2018b], have shown the effectiveness of word embeddings in unsupervised keyphrase extraction.

In the presence of domain-specific data, supervised methods have shown better performance. Unsupervised methods have the advantage of not requiring any training data and can produce results in any domain. However, the assumptions of unsupervised methods do not hold for every type of document. On the other hand, casting keyphrase extraction as a binary classification problem has its disadvantages since each candidate phrase is labeled independent of all others.

Gollapalli et al. [Gollapalli and Li2016] was one of the first works to approach keyphrase extraction as a sequence labeling task. They used Conditional Random Fields (CRF) with many textual features such as tf-idf of the terms, orthographic information, POS tags, and positional information. Alzidy et al. [Alzaidy, Caragea, and Giles2019] used BiLSTM-CRFs, where the words were represented using fixed word embeddings like Glove embeddings [Pennington, Socher, and Manning2014]. In this paper, we further explore sequence labeling techniques for keyphrase extraction using contextualized word embeddings.

2.2 Contextual Embeddings

Recent research has shown that deep-learning language models trained on large corpora can significantly boost performance on many NLP tasks and be effective in transfer learning

[Peters et al.2018, Devlin et al.2019, Radford et al.2018]. All these models aim to optimize for the traditional language model objective, which is to predict a word given its surrounding context. Therefore, they can assign each token a contextual numerical representation that is a function of the entire input sentence. These contextual word embeddings are in contrast to the previously popular “fixed” word embeddings that are learned using models such as Word2Vec [Goldberg and Levy2014], Glove [Pennington, Socher, and Manning2014], and FastText [Joulin et al.2017]. The word ‘apple’ in the sentences - “I bought apples from the farmers market”, and “I bought the new Apple IPhone”, which is used in two different contexts, will be assigned the same numerical representation by a “fixed” embedding model but a contextual model would differ the representation based on the sentence.

ELMo [Peters et al.2018]

uses stacked bidirectional LSTMs with a residual connection to model sentences and the tokens are represented using character-level CNNs. BERT

[Devlin et al.2019] and OpenAI GPT [Radford et al.2018] do away with the use of LSTMs and instead employ multi-layer Transformers [Vaswani et al.2017] for sentence modeling. The main difference between BERT and OpenAI GPT is that the former uses a bidirectional transformer while the latter uses a left-to-right Transformer. OpenAI GPT-2 model [Radford et al.2019] was proposed as a direct successor of the GPT model, and is trained on 10x more data than the original GPT, with better performances and an additional benefit of working in a zero-shot transfer setting. RoBERTa [Liu et al.2019] was recently published as a replication study on BERT, where a new larger dataset was used for training, using more iterations, and removing the next sequence prediction training objective from the original model. This helped RoBERTa to achieve state-of-the-art performance on different benchmark datasets making it equivalent in performance with other contextual language models like Transformer XL [Dai et al.2019], and XLNet [Yang et al.2019]. More recent works, like SciBERT [Beltagy, Cohan, and Lo2019], have used the BERT architecture to build domain-specific language models. SciBERT was trained on a corpus of 1.14 million scientific papers mostly from computer science and biomedical domains. It has achieved state of the art results on many scientific NLP tasks including NER, document classification, and dependency parsing.

In this work, we combine the benefits of formulating keyphrase extraction as a sequence labeling task with the rich representation of language by contextual embeddings. To our knowledge, this is the first work which attempts a robust evaluation of the performance of a BiLSTM-CRF architecture using contextual embeddings for keyphrase extraction. We also present a thorough comparison of the performance of different word embedding models.

3 Methodology

We approach the problem of automated keyphrase extraction from scholarly articles as a sequence labeling task, which can be formally stated as: Let be an input text, where represents the token. Assign each in the document one of three class labels , where denotes that marks the beginning of a keyprahse, means that is inside a keyphrase, and indicates that is not part of a keyphrase.

In this paper, we employ a BiLSTM-CRF architecture to solve this sequence labeling problem. LSTMs [Gers, Schmidhuber, and Cummins1999]

are recurrent neural networks that deal with vanishing and exploding gradient problems with the use of gated architectures. Bidirectional LSTMs (BiLSTM) are generalization of LSTMs that capture long-distance dependencies between words in both directions.

We first map each token

in the input text to a fixed-size dense vector

, thus is represented as sequence of vectors . The corresponding class labels are , where . We then use a BiLSTM to encode sequential relations between the word representations. A LSTM unit consists of four main components: input gate (), forget gate (), memory cell (), and output gate (), which are defined as below:

(1)
(2)
(3)
(4)
(5)

In the above equations,

denotes the sigmoid function,

the hyperbolic tangent function, and an element-wise dot product. and

are model parameters that are estimated during training, and

is the hidden state.

Figure 1: BiLSTM-CRF architecture

In a bidirectional LSTM, we apply equations 1 to 5 in both directions to create two hidden state vectors and , where provides a representation for word by incorporating information from the preceding words , and builds a representation for word by capturing information from the succeeding words . By concatenating and , we get a vector representation of in the context of the input text .

(6)

We then apply an affine transformation to map the output from the BiLSTM to the class space:

(7)

where is a matrix of size and .

The score outputs from the BiLSTM serve as input to a CRF layer. CRFs [Lafferty, McCallum, and Pereira2001] are discriminative probabilistic models that have been used in many sequence tagging problem in NLP. CRFs when used in conjunction with deep learning models [Huang, Xu, and Yu2015] have been shown to improve the performance of many sequence labeling tasks.

In a CRF, the score of an output label sequence is:

(8)

is a transition matrix where represents the transition score from class to . The likelihood for a labeling sequence is generated by exponentiating the scores and normalizing over all possible output label sequences.

(9)

During inference, CRFs use the Viterbi algorithm to efficiently find the optimal sequence of labels. The entire architecture is summarized in Figure 1. As a baseline, we also experiment with a plain BiLSTM architecture.

4 Experiments

Datasets - We ran our experiments on three different publicly available keyphrase extraction datasets: Inspec [Hulth2003], SemEval-2010 [Kim et al.2010] (hereafter, SE-2010), and SemEval-2017 [Augenstein et al.2017] (hereafter, SE-2017). Inspec consists of abstracts from 2000 scientific articles divided into train, validation and test sets containing 1000, 500, and 500 abstracts respectively. Each abstract is accompanied by two sets of human-annotated keyphrases: controlled - as assigned by the authors, and uncontrolled - assigned by the readers. Controlled keyphrases are mostly abstractive, i.e., not present in the abstracts, whereas the uncontrolled keyphrases are mostly extractive. SE-2010 consists of 284 full length ACM articles divided into train, trial and test splits containing 144, 40, and 100 articles respectively. SE-2010 also has author-assigned (controlled) and reader-assigned (uncontrolled) keyphrases. SE-2017 consists of 500 open access articles published in ScienceDirect divided into train, dev and test sets containing 350, 50, and 100 articles respectively. Unlike the other two datasets, SE-2017 provides location spans for all keyphrases, i.e. all keyphrases are extractive.

Because we are modeling keyphrase extraction as a sequence labeling task, we only consider extractive keyphrases that are present in the article abstracts in each data set. For Inspec and SE-2010, we automatically identified the location spans for each extractive keyphrase. We discarded the full text articles from SE-2010 and SE-2017 due to memory constraints during training and inference with the contextual embedding models on large documents. We also discarded the trial dataset from SE-2010 and instead randomly chose 14 documents from the train split for validation.

All the tokens in each dataset were tagged using the B-I-O tagging scheme described in the problem statement in the previous section. We plan to release this processed dataset along with this publication. Table 1 provides some general statistics on the processed dataset used in this paper.

Dataset SE-2010 Inspec SE-2017
Train Dev Test Train Dev Test Train Dev Test
# Docs 130 14 100 1000 500 500 350 50 100
Avg # Keyphrases
9.95 9.71 9.84 9.81 9.18 9.74 15.62 19.54 17.69
Max Len of Keyphrases
6 6 6 8 8 9 27 19 30
Avg Len of Keyphrases
1.72 1.84 1.73 2.15 2.15 2.15 3.16 2.72 2.54
Avg # Tokens
185 201 207 142 133 135 187 225 220
Max # Tokens
432 416 395 557 330 384 350 399 389
Min # Tokens
55 111 65 15 16 23 65 137 102
Table 1: General statistics of the processed dataset used in our experiments

Experimental Settings - One of the main aims of this work is to study the effectiveness of contextual embeddings in keyphrase extraction. To this end, we use the BiLSTM-CRF and BiLSTM architectures with seven different pre-trained contextual embeddings: BERT (small-cased, small-uncased, large-cased, large-uncased), SciBERT (basevocab-cased, basevocab-uncased, scivocab-cased, scivocab-uncased), OpenAI GPT, ELMo, RoBERTa (base, large), Transformer XL, and OpenAI GPT-2 (small, medium). As a baseline, we also use 300 dimensional fixed embeddings from Glove222https://nlp.stanford.edu/projects/glove/, Word2Vec333https://github.com/mmihaltz/word2vec-GoogleNews-vectors, and FastText444https://fasttext.cc/docs/en/english-vectors.html (common-crawl, wiki-news). We also compare the proposed architecture against four popular baseline keyphrase extraction techniques: SGRank, SingleRank, Textrank, and KEA. Of these, the first three are unsupervised while KEA is a supervised technique.

We trained BiLSTM-CRF and BiLSTM models using Stochastic Gradient Descent (SGD) with Nesterov momentum in batched mode. Due to system memory constraints the batch size was set to 4. The learning rate was set to 0.05 and the models were trained for a total of 100 epochs with patience value of 4 and annealing factor of 0.5; i.e., if the model performance did not improve for 4 epochs, then the learning rate would be reduced by a factor of 0.5. The hidden layers in the BiLSTM models were set to 128 units and word dropout set to 0.05. The token representations obtained using different embeddings are not tuned during the training process. During inference, we run the model on a given abstract and identify keyphrases as all sequences of class labels that begin with the tag

followed by zero or more tokens tagged . As used by previous studies [Kim et al.2010], we use Precision, Recall, and F1-measure based on actual matches against the ground-truth for evaluating the different approaches, and use the F1-measure to compare between different models555Due to space constraints we only report the F-measure in the paper..

5 Results

In this section, we report the main observations for the different experiments that we performed. For each embedding model we report results for the best performing variant of that model (e.g. cased vs uncased) on each dataset. Overall, BERT and SciBERT models are the best-performing embedding models, and the BiLSTM-CRF architecture is the best architecture, consistently across all the datasets.

5.1 Architectures

CRF layer - Table 2 presents a comparison of the BiLSTM and BiLSTM-CRF architectures in terms of F1-scores for three models: SciBERT, BERT, and ELMo. The addition of the CRF layer improved the performance for all datasets. However, it was most effective for ELMo. For example, on the SE-2010 data with ELMO, the CRF layer increased the F1-score by nearly 50% from 0.157 to 0.225. An analysis of results on the SemEval-2017 data shows that the CRF layers is more effective in capturing keyphrases that include prepositions (e.g. ‘of’), conjunctions (e.g. ‘and’), and articles (e.g. ‘the’). We also observed that the CRF layer is more accurate with longer keyphrases (more than two tokens).

SciBERT
Inspec SE-2010 SE-2017
BiLSTM-CRF 0.593 0.357 0.521
BiLSTM 0.536 0.301 0.455
BERT
Inspec SE-2010 SE-2017
BiLSTM-CRF 0.591 0.330 0.522
BiLSTM 0.501 0.295 0.472
ELMo
Inspec SE-2010 SE-2017
BiLSTM-CRF 0.568 0.225 0.504
BiLSTM 0.457 0.157 0.428
Table 2: BiLSTM vs BiLSTM-CRF (F1-score)

Fine-tuning - Contextualized embedding models can be used in two ways: (1) they can serve as numerical representations of words that are to be used in downstream architectures, or (2) they can be fine-tuned to be optimized for a specific task. Fine-tuning typically involves adding an untrained layer at the end and then optimizing the layer weights for the task-specific objective. We fine-tuned our best-performing contextualized embedding models (BERT and SciBERT) for each dataset and compared with the performance of the corresponding BiLSTM-CRF models when used with the same pre-trained embeddings. The results are summarized in Table 3. The BiLSTM-CRF outperforms contextual embedding model fine-tuning across all datasets, for both BERT and SciBERT. We think this might be due to the small sizes of the datasets on which the models are fine-tuned.

BERT
Inspec SE-2010 SE-2017
Fine-tuning 0.474 0.236 0.270
BiLSTM-CRF 0.591 0.330 0.522
SciBERT
Inspec SE-2010 SE-2017
Fine-tuning 0.488 0.268 0.339
BiLSTM-CRF 0.593 0.357 0.521
Table 3: Fine-tuning vs Pretrained (F1-score)

5.2 Contextual embeddings

Here we want to understand which of the various contextual embedding models that were discussed in Section 2

is best suited for this task. These models vary in architecture, training data and hyperparameter choices. Table

4 presents the performance of the best variant (in our experiments) of the seven contextual embedding models and three fixed embedding models, using the BiLSTM-CRF architecture.

Of the ten embedding architectures, BERT or BERT-based models consistently obtained the best performance across all datasets. This was expected considering that BERT uses bidirectional pre-training which is more powerful. SciBERT was consistently one of the top performing models and was significantly better than any of the other models on SemEval-2010. Further analysis of the results on SemEval-2010 shows that SciBERT was more accurate than other models in capturing keyphrases that contained scientific terms such as chemical names (e.g. ‘Magnesium’, ‘Hydrozincite’), software projects (e.g. ‘HemeLB’), and abbreviations (e.g. ‘DSP’, ‘SIMLIB’). SciBERT was also more accurate with keyphrases containing more than three tokens. The differences in training vocabulary is apparent where SciBERT is able to classify scientific nouns (e.g. ‘real time’, ‘chip’, ‘transform’).

Inspec SE-2010 SE-2017
SciBERT 0.593 0.357 0.521
BERT 0.591 0.330 0.522
ELMo 0.568 0.225 0.504
Transformer-XL 0.521 0.222 0.445
OpenAI-GPT 0.523 0.235 0.439
OpenAI-GPT2 0.531 0.240 0.439
RoBERTa 0.595 0.278 0.508
Glove 0.457 0.111 0.345
FastText 0.524 0.225 0.426
Word2Vec 0.473 0.208 0.292
Table 4: Embedding models comparison (F1-score)

Contextual embeddings outperformed their fixed counterparts for most of the experimental scenarios. The only exception was on SemEval-2010 where FastText outperformed Transformer-XL. Of the three fixed embedding models studied in this paper, FastText obtained the best performance across all datasets.

We also compare the training (Figure 2) and validation loss (Figure 3) of BiLSTM-CRF models on four embeddings: SciBERT, BERT, FastText, and Word2Vec. These figures show that the loss values have reduced faster for contextual models than fixed embeddings. Also, SciBERT’s validation loss converged faster than BERT. This is consistent with some of the findings in transfer learning literature [Goldberg2017], which has shown that these pre-trained contextual models with rich linguistic information can easily adapt to new domains.

Figure 2: Training loss curve of BLSTM-CRF using BERT and SciBERT vs fixed embeddings for Inspec.
Figure 3: Validation loss curve of BLSTM-CRF using BERT and SciBERT vs fixed embeddings for Inspec.

5.3 Baseline comparisons

Lastly, we compare our best performing model (BiLSTM-CRF with SciBERT embeddings) against four baseline methods: SGRank, SingleRank, TextRank, and KEA. Table 5 presents the results. As expected, our model significantly outperforms all the baseline methods for all three datasets. Of the four baseline methods evaluated here, SGRank achieved the best results. This observation holds true irrespective of the embeddings used. For example, on Inpsec data, the worst performing BiLSTM-CRF model (Glove at 0.457) is still significantly better than the best performing baseline model (SGRank at 0.271).

For SemEval-2017, the best reported F1 score is 0.55 [Ammar et al.2017]. Ammar et al’s work makes use of keyphrase type information in their modeling (if it is a task, material, or process) and they also employs some task specific gazetteers and therefore not directly comparable to our results. Likewise, to our knowledge, the best reported F1 score for SemEval-2010 is 0.29 [Mahata et al.2018a] but those models use the entire articles. Though these numbers are not directly comparable to our work, we reported them here to provide a complete context. For Inspec, Mahata et al. [Mahata et al.2018a] reported an F1 score of 0.52 which is significantly outperformed by our best model at 0.59.

Inspec SE-2010 SE-2017
SGRank 0.271 0.229 0.211
SingleRank 0.123 0.142 0.155
TextRank 0.122 0.147 0.157
KEA 0.137 0.202 0.129
BiLSTM-CRF 0.593 0.357 0.521
Table 5: Comparison with baseline methods (F1-score)

6 Case study: Attention Analysis

Attention analysis is used to understand if neural network-internal attention mechanisms provide any insight into the linguistic properties learned by the models. We present a case study of attention analysis for keyphrase extraction on a randomly chosen abstract from SemEval2017.

SciBERT BERT
An object-oriented version of SIMLIB -LRB- a simple simulation package -RRB- This paper introduces an object-oriented version of SIMLIB -LRB- an easy-to-understand discrete-event simulation package -RRB- . The object-oriented version is preferable to the original procedural language versions of SIMLIB in that it is easier to understand and teach simulation from an object point of view . A single-server queue simulation is demonstrated using the object-oriented SIMLIB An object-oriented version of SIMLIB -LRB- a simple simulation package -RRB- This paper introduces an object-oriented version of SIMLIB -LRB- an easy-to-understand discrete-event simulation package -RRB- . The object-oriented version is preferable to the original procedural language versions of SIMLIB in that it is easier to understand and teach simulation from an object point of view . A single-server queue simulation is demonstrated using the object-oriented SIMLIB
Table 6: SciBERT vs BERT: keyphrase identification

Table 6 presents the classification results on this abstract from the BERT and SciBERT models; true positives are marked in green and false negatives in red. Using BertViz [Vig2019] we analyzed the aggregated attention of all 12 layers of both models. We observed that keyphrase tokens ( and ) typically tend to pay most attention towards other keyphrase tokens. Contrarily, non-keyphrase tokens () usually pay uniform attention to their surrounding tokens. We found that both BERT and SciBERT exhibit similar attention patterns in the initial and final layers but they vary significantly in the middle layers. For example, Figure 5 compares the attention patterns in the fifth layer of both models. In SciBERT, the token ‘object’ is very strongly linked to other tokens from its keyphrase but the attentions are comparably weaker for BERT.

We also observed that keyphrase tokens paid strong attention to similar tokens from other keyphrases. For example, as shown in Figure 5, the token ‘version’ from ‘object-oriented version’ pays strong attention to ‘versions’ from ‘procedural language versions’. This is a possible reason for both models failing to identify the third mention of ‘object-oriented version’ in the abstract as a keyphrase. We observed similar patterns in many other documents through our attention analysis. In future work, we plan to quantify this analysis over multiple documents.

Figure 4: Attention Comparison in Middle Layers
Figure 5: Inadvertent similar token attention in SciBERT

7 Conclusions

In this paper, we formulate keyphrase extraction as a sequence labeling task solved using BiLSTM-CRFs, where the underlying words are represented using various contextualized embedding models. Through our experimental work, we quantify the benefits of this architecture over direct fine tuning of the embedding models. We also demonstrate how contextual embeddings significantly outperform their fixed counterparts in keyphrase extraction, with BERT based models performing the best. We also performed attention analysis on a sample scientific abstract to build an intuitive understanding of the working of the self-attention layers of BERT and SciBERT.

Our approach only deals with the problem of keyphrase extraction but not generation. In the future, we plan to use some of the findings from this paper to help build keyphrase generation models. It would also be beneficial to look at the working of self-attention layers with greater detail and study the fine-tuning capabilities of the contextual embedding models on bigger datasets for the task of keyphrase extraction. We also expect some of the findings in this paper could prove useful to other NLP problems like document summarization.

References