Keyphrase Extraction with Span-based Feature Representations

02/13/2020 ∙ by Funan Mu, et al. ∙ Harbin Institute of Technology Tencent 0

Keyphrases are capable of providing semantic metadata characterizing documents and producing an overview of the content of a document. Since keyphrase extraction is able to facilitate the management, categorization, and retrieval of information, it has received much attention in recent years. There are three approaches to address keyphrase extraction: (i) traditional two-step ranking method, (ii) sequence labeling and (iii) generation using neural networks. Two-step ranking approach is based on feature engineering, which is labor intensive and domain dependent. Sequence labeling is not able to tackle overlapping phrases. Generation methods (i.e., Sequence-to-sequence neural network models) overcome those shortcomings, so they have been widely studied and gain state-of-the-art performance. However, generation methods can not utilize context information effectively. In this paper, we propose a novelty Span Keyphrase Extraction model that extracts span-based feature representation of keyphrase directly from all the content tokens. In this way, our model obtains representation for each keyphrase and further learns to capture the interaction between keyphrases in one document to get better ranking results. In addition, with the help of tokens, our model is able to extract overlapped keyphrases. Experimental results on the benchmark datasets show that our proposed model outperforms the existing methods by a large margin.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Keyphrases are the most important critical and topical phrases for a given text. They are capable of providing a concise summary for a piece of text, and facilitating the management, categorization, and retrieval of information. Keyphrase extraction [25] is widely used in real-world applications such as recommendation system [5] and information retrieval [12].

Several methods have been proposed in previous studies. There are three main approaches: (i) traditional two-step ranking method, (ii) sequence labeling method and (iii) generation-based neural network method. A straightforward way of keyphrase extraction is to decompose this task into two steps: candidate phrases generation and candidate phrases scoring [25, 17]

. In the first step, models generate a list of candidate phrases using the N-grams or phrases with certain part-of-speech patterns. In the second step, candidate phrases are scored by its probability of being a keyphrase in the given document. These two-stage ranking-based methods treat candidate phrases individually, which makes it almost impossible to capture the contextual information of candidates and interactions between different phrases. Further, existing two-stage methods

[25, 17] are based on feature engineering, which is labor intensive and domain dependent. Another intuitive approach is to regard keyphrase extraction as a sequence labeling task [28]. However, sequence labeling approach can hardly tackle keyphrase with overlapping words.

Figure 1: The overlap phenomenon in keyphrases.

As shown in Figure 1

, “weighted ranking algorithm” and “ranking algorithm” are both keyphrases that provide semantic information of different granularity. Unfortunately, sequence labeling methods are not able to extract both of them at the same time. With the development of deep learning, especially sequence-to-sequence methods, generation-based methods 

[18, 3] have attracted much attention. Admittedly, generation-based approach is capable of dealing with the overlap keyphrases without much labor-intensive feature engineering, but such an approach has two shortcomings. Firstly, generation method can not utilize context information effectively and capture the interaction of phrases. Secondly, generation method produces tokens of keyphrases continuously which suffers the problem of not utilizing phrases-level information. In other words, tokens in a phrase can only make sense when they appear as a whole sequence. Therefore, a desirable solution should be able to capture the information within word sequences and take advantage of this span-based information when predicting the keyphrases.

In this paper, we make the very first step to perform keyphrase extraction by Span Keyphrase Extraction  (SKE) model based on span-based feature representation. The proposed SKE model first extracts candidate phrases using the certain part-of-speech patterns [11] and records the beginning and ending positions of each candidate phrase as spans. After that, Bert [4]

or recurrent neural network based on word vectors is used to represent the high-level concept of phrases. We call the high-level representation span-based representation. Afterward, bidirectional recurrent neural network (i.e., LSTM and GRU) is used to capture the interaction of span-based representation to get higher-level phrase representation. After getting the phrase representation, we are able to use them to classify the candidate phrase. Especially, BERT is pre-trained with a large amount of data and contains language knowledge, position information, and contextual information, so it is designed to encode the phrase representation. The generation of candidate phrases allows the overlap between phrases. Further, the ranking processing is able to utilize the context information of phrases. In summary, with the well-designed model, Span Keyphrase Extraction  can obtain the overall semantic meaning of both the documents and keyphrases and learn the interaction between phrases. The main contributions are as follows:

  • To the best of our knowledge, Span Keyphrase Extraction  is the first attempt to use span-based features for keyphrase extraction. The span-based approach is capable of tackling overlap problem effectively.

  • The proposed Span Keyphrase Extraction  model is capable of utilizing context information. Span-based features of phrases are based on the document. Bidirectional recurrent network (i.e., LSTM and GRU), which can utilize context information, are employed to capture the interactions between phrases based on span-based representations.

  • We conduct experiments on five benchmark datasets, comparing our SKE model with strong baseline models. Experiment results show that our proposed model achieves state-of-the-art performance.

Related Work

Obtaining high-quality keyphrases for documents is a classic and challenging problem in natural language processing, which has been widely studied in previous works. The existing methods can be categorized into three groups: two-step ranking approach, sequence labeling approach and generation approach.

Traditional ranking-based method consists of two stages. The first stage is to acquire a set of candidate phrases with heuristic methods, such as extracting important n-grams

[9] and selecting text chunks with certain part-of-speech tags [15, 11]

. These candidates ought to cover the correct answer as much as possible since the coverage will heavily affect the final result. The second stage is to score the candidate phrases through the probability of being a keyphrase using classifier or heuristic metrics. Keyphrases are selected from the top-ranked candidates. witten2005kea,hulth2003improved,medelyan2009human,zhang2019keywords witten2005kea,hulth2003improved,medelyan2009human,zhang2019keywords use a classification model to solve this problem by features and machine learning models. These models are based on feature engineering, which is labor intensive and domain dependent. Other researchers look forward to unsupervised methods

[19, 16, 27]. On the other side, tomokiyo2003language tomokiyo2003language apply two statistical language models to measure the phraseness and informativeness for phrases. liu2011automatic liu2011automatic use a word alignment model, which learns a translation from the documents to the keyphrases. This approach alleviates the problem of vocabulary gaps between source and target to a certain degree whereas failed to handle semantic meaning.

Another straightforward way is to regard keyphrase extraction as a sequence labeling task, zhang2016keyphrase zhang2016keyphrase propose a joint-layer recurrent neural network model to extract keyphrases from tweets. However, this method fails to handle overlapped keyphrases.

With the successful application of the sequence-to-sequence model in the field of machine translation, keyphrase generation [18, 3] methods have received much attention. By attention mechanism [2], copy mechanism [6], coverage mechanism [23] and review mechanism [3], generation based models achieve the state-of-the-art performance. However, generation method can not utilize context information effectively and the phrases-level information.

Our proposed Span Keyphrase Extraction  is basically a two-stage ranking-based approach. The main contribution of our model is that the span-based approach is capable of tackling overlap problem effectively and is capable of utilizing context information. Previous two-stage methods [25, 17] view each phrase as an instance even they belong to one document. However, we regard each document as an instance and use Bi-LSTM to capture the interaction between keyphrases to get better ranking results. Further, we use BERT [4], which is a pre-trained language representation, to take advantage of the abundant language knowledge, position information, and contextual information.


This section gives the detail of our proposed Span Keyphrase Extraction  model. First, we define the task of keyphrase extraction. Then, we give a brief presentation of the token features. Finally, we describe the method to extract span-based feature representations of phrases and how to rank them.

Figure 2: Model Architecture. We first use BERT to extract token features. Then span-based feature representations of phrases are produced based on token features. Another Bi-LSTM is adopted to learn the interactions of phrases.


Given a keyphrase extraction dataset , where is the -th document text and is the keyphrase set of . and are the number of documents and keyphrases of respectively. Both the document text and target keyphrase are token sequences which can be denoted as and . and denote the length of token sequences of and respectively.

We denote the beginning and ending position of phrases in the document as spans and define the span-based feature as follows:

Definition 1

[Span-based Feature] Let be the features of tokens in a document. The span of a phrase is . The span-based feature are the relations between and .

According to the definition, we apply three relations, i.e., identical, element multiply and element difference. Details are described in Span-based Feature Representations Section.

We extract candidate phrases set of as . Number of is denoted as . Intersection phrases of and are the positives denoted as . The rest candidate phrases are the negatives denoted as . In this way we can convert keyphrase extraction to a binary classification task by minimizing the cross entropy loss


where is the likelihood of being a keyphrase.

Hinge loss is another effective objective to optimize our proposed model:



is the shorthand of margin which is a hyperparameter.

Figure 2 gives the detail of Span Keyphrase Extraction  model. The bottom of the figure is BERT, which is used to extract token features. The middle part of the figure is the spaned-based representation module. The top of the figure is the bidirectional recurrent neural network, which is used to learn the interaction of phrases. More details about the processing is given in Algorithm 1.

1:The train dataset ; The BERT model ; The part-of-speech tagger ; The part-of-speech pattern ; The Porter Stemmmer ;
2:for each (d, do
3: tagger document ;
4: extract candidate
5: record spans of c as ;
6: stem candidate phrases ;
7: stem keyphrases ;
8: ;
9: ;
10: compute token features ;
11: ;
12: ;
13: for each (b, e) ; do
14: ;
15: ;
16: end for
17: compute for and with H;
18: compute and ;
19:end for
Algorithm 1 Training procedure of the proposed model

Token Features

As mentioned before, we first obtain the features of tokens with BERT, which utilizes the abundant language knowledge, position information, and contextual information it contains. Given , BERT begins by converting the sequence of tokens into a sequence of vectors , . Each of these vectors is the sum of a token embedding, a positional embedding that represents the position of the token in the sequence, and a segment embedding that represents whether the token is in the source text or the auxiliary text. We only have source text so the segment embeddings are the same for all tokens. Then several Transformer [24] layers are applied to get the final representations. Each Transformer layer has two sub-layers. The first sub-layer is a multi-head self-attention mechanism [13, 26]

, and the second sub-layer is a simple, position-wise fully connected feed-forward network. A residual connection

[7] is employed around each of the two sub-layers, followed by layer normalization [1]. By concatenating head self-attention output, we obtain the final values of multi-head self-attention:


This is followed by a fully connected feed-forward network with GELU [8] activation


We use the final hidden output of BERT as the representations of corresponding tokens.

Figure 3: Sentence pair classfication base on BERT. We use the final hidden vector as the aggregate representation.

Span-based Feature Representations

We extract the spans of phrases and use

produced by BERT to extract features of phrases. Firstly, a Bi-LSTM model takes the token’s hidden representation

as input to get local-aware features of tokens. The hidden dimension of the Bi-LSTM model is .


We denote the forward representations of the beginning and the ending tokens as and , the backward representations of the beginning and the ending tokens as and . Inspired by conneau2017supervised conneau2017supervised, we concatenate three kinds of vectors as the phrase representation: (i) identical representations , ; (ii) element-wise product (; and (iii) element-wise difference (). In this way, we get span-based feature representations of phrases. Then, the representations of phrases are feed into another Bi-LSTM with a hidden dimension to learn the interaction between the phrases to get better ranking results. By concatenating the forward and backward vectors, we get the final representation .


Ranking Phrases

We propose to use two methods to train the model. The first method is regarding the task as a binary classification problem. We use a fully connected feed-forward network to obtain the final representation of phrase denoted as . We conduct softmax on L and use the cross entropy loss shown in Equation (1). The second way is regarding the task as a ranking problem. We use a fully connected feed-forward network with sigmoid activation to obtain the final representation of phrase . Equation (2) gives the detail of hinge loss.

Dataset Inspec Krapivin NUS SemEval KP20k KP Training KP Validation
# 6.16 2.80 4.87 5.26 2.86 2.86 2.85
#Keyphrase 7.13 3.25 6.31 6.25 3.16 3.14 3.15
Keyphrases Coverage by 86.36% 86.14% 77.16% 84.16% 90.50% 90.58% 90.58%
# 54.43 68.21 101.98 101.18 73.40 76.40 76.40
Ratio of # and # 8.84 24.38 20.95 19.23 25.65 28.70 25.76
Table 1: Details of candidate phrases and keyphrases. The coverage show that 9.42% to 22.84% keyphrases are not covered by candidate phrases. This will limit the final performance of our model. The ratio of and is range from 9.85 to 28.85 meaning that the classification task is imbalanced.


This section begins by discussing the datasets we experiment on and the details of how we pre-process the datasets, followed by the baselines and the evaluation metrics. Then we present results and analysis.

Method Inspec Krapivin NUS SemEval KP20k
Tf-Idf 0.221 0.313 0.129 0.160 0.136 0.184 0.128 0.194 0.108 0.134
TextRank 0.223 0.281 0.189 0.162 0.195 0.196 0.176 0.187 0.180 0.150
Maui 0.040 0.042 0.249 0.216 0.249 0.268 0.044 0.039 0.273 0.240
KEA 0.098 0.126 0.123 0.134 0.069 0.084 0.025 0.026 0.182 0.167
CopyRNN 0.278 0.341 0.311 0.266 0.334 0.326 0.293 0.304 0.333 0.262
CorrRNN - - 0.318 0.278 0.361 0.335 0.320 0.320 - -
BERT-Base-Pair 0.302 0.340 0.288 0.247 0.382 0.362 0.316 0.330 0.373 0.313
SKE-Base-Rank 0.289 0.321 0.287 0.236 0.389 0.365 0.354 0.337 0.381 0.324
SKE-Base-Cls 0.305 0.342 0.312 0.251 0.395 0.371 0.352 0.342 0.386 0.326
SKE-Large-Rank 0.300 0.334 0.313 0.264 0.400 0.379 0.356 0.351 0.392 0.328
SKE-Large-Cls 0.294 0.334 0.309 0.252 0.403 0.364 0.361 0.358 0.392 0.330
Table 2: Performance on five benchmark datasets.


meng2017deep meng2017deep collect a large amount of high-quality scientific metadata from various online digital libraries. We use KP to mark the dataset. KP contains 567,830 articles and we use the same splits as meng2017deep meng2017deep with 527,830 for training, 20k for validating and 20k as test dataset denoted as KP20k.

Following meng2017deep meng2017deep, we evaluate the proposed model on four widely-adopted scientific publication datasets and KP20k mentioned above. We take the title and abstract as the document text. In meng2017deep meng2017deep, each dataset is also split into training and test dataset for baseline models. We continue using the same test dataset for a fair comparison. Each dataset is described in detail below.

- Inspec [9]: This dataset provides 2,000 paper abstracts. 500 testing papers adopted and their corresponding uncontrolled keyphrases for evaluation.

- Krapivin [10]: This dataset provides 2,304 papers with full-text and author-assigned keyphrases. We selected the first 400 papers in alphabetical order as the testing data.

- NUS [20]: We use both author-assigned and reader-assigned keyphrases and treat all 211 papers as the testing data.

- SemEval-2010 [20]: 288 articles were collected from the ACM Digital Library. 100 articles were used for testing.

- KP20k [18]: 567,830 scientific articles contains titles, abstracts, and keyphrases in computer science. 20k were randomly selected as test dataset.

We use the certain part-of-speech pattern to extract the candidate phrases inspired by le2016unsupervised le2016unsupervised. Part-of-speech is performed by Stanford Log-linear Part-Of-Speech Tagger tools [22]. The following part-of-speech pattern [11] is used.

The intersection of candidate phrases and keyphrases are positives denoted as . The number of and keyphrases is shown in Table 1. When determining the match of two phrases, we use Porter Stemmer for preprocessing. The coverage shows that 9.42% to 22.84% keyphrases are not cover by candidate phrases. This will limit the final performance of our model. Our candidates only cover 77.16% keyphrases of NUS, the main reason is NUS contains irregular keyphrases assigned by reader. The ratio of and is range from 9.85 to 28.85 meaning that data is imbalanced for classification task. Statistical indicators are close among KP dataset which shows this is a reasonable split.

Baseline Model

We first compare our work with 6 baseline algorithms. Unsupervised algorithms (Tf-Idf, TextRank [19], two-step supervised algorithms (KEA [25] and Maui [17]) and keyphrase generation algorithms (CopyRNN [3] and CorrRNN [3]) are adopted. For the first 5 algorithms, we use the performance reported by meng2017deep meng2017deep and for the last one, results of chen2018keyphrase chen2018keyphrase are used.

We also compare with the sentence pair classification proposed by devlin2018bert devlin2018bert which is shown in Figure 3. The basic setting is the same in Section Token Features except the inputs. For sentence pair classification, the input consists of a document and a phrase with different segment embeddings. We use the final hidden vector corresponding to the first input token ([CLS]) as the aggregate representation. A classification layer is used to determine whether this phrase is a keyphrase in this document. Compared to our model, this method treats each phrase as an instance, which fails to capture the interaction between phrases and takes longer to train. Details of training time are introduced in Section Training Time.

Method Training Time Nus


BERT-Base-Pair 132 0.373 0.313
SKE-Base-Rank 2 0.381 0.324
SKE-Base-Cls 2 0.386 0.326
BERT-Large-Pair - - -
SKE-Large-Rank 7 0.392 0.328
SKE-Large-Cls 7 0.392 0.330
Table 3: The training time of sentence pair classification and our method. As sentence pair classification model takes too much time to train based on BERT-Large, limiting its usage in practical applications, so we don’t report the training time and performance.
Title: Deployment issues of a voip conferencing system in a virtual conferencing environment.
Abstract: Real time services have been supported by and large on circuitswitched networks. Recent trends favour services ported on packet switched network. For audio conferencing, we need to consider many issues scalability, quality of the conference application, floor control and load on the clients servers to name a few. In this paper, we describe an audio service framework designed to provide a virtual conferencing environment (vce). The system is designed to accommodate a large number of end users speaking at the same time and spread across the internet. The framework is based on conference servers , which facilitate the audio handling, while we exploit the sip capabilities for signaling purposes. Client selection is based on a recent quantifier called loudness number that helps mimic a physical face to face conference. We deal with deployment issues of the proposed solution both in terms of scalability and interactivity, while explaining the techniques we use to reduce the traffic. We have implemented a conference server (cs) application on a campus wide network at our institute.
CorrRNN: voip; virtual conferencing; voip conferencing; audio conferencing; audio service; real time services; real time; distributed systems; conference server; virtual conferencing environment;
SKE-Large-Cls: conference server; voip; virtual conferencing; sip; audio conferencing;deployment; loudness number; scalability; virtual conferencing environment; conferencing;
Table 4: Top10 phrases provided by CorrRNN and SKE-Large-Cls. Phrases in bold are correct. SKE-Large-Cls provides more keyphrases and better ranking results.

Training Details

devlin2018bert devlin2018bert provides two pre-trained models named BERT-Base and BERT-Large with the same transformer structure. The main difference between them is the hidden dimension of the transformer which leads to different parameter size. BERT-Base has 110M parameters while BERT-Large has 340M parameters. We train our model both based on uncased BERT-Base and uncased BERT-Large. For both models, we use AdamW loshchilov2017fixing loshchilov2017fixing with warmup as the optimizer following. Apart from parameters of BERT, the other parameters are randomly initialized and all parameters are fine-tuned using training dataset. For the classification task, different weights for cross entropy are used for positive and negative phrases according to the ratio of them. Specifically, the weight of negative is 1 and the weight of positive is picked in {10, 15, 20}. For the ranking task, the margin is selected using a grid search in [0, 1]. We use the learning rate of 5e-5 and a warmup proportion of 0.1 for AdamW. The L2 weight decay rate is set to 0.01. We use minibatches of size 128 and epochs of size 5. All models are trained on a single machine with 8 x NVIDIA Tesla P40 GPUs. And the best model is selected using the validation dataset of KP among epochs. The random seed is fixed for a stable result.

For training dataset, we only keep the documents with at least one keyphrase is matched using certain part-of-speech patterns [11]. 499,087 documents are left for training our proposed model.

There are 42,375,099 pairs for training sentence pair classification model as this method can only process a phrase at a time which makes the training time dozens of times longer than our model. Details of training time are shown at Section Training Time. We only conduct experiments based on BERT-Base for sentence pair classification since it is not practical to use BERT-Large in this setting.

Method KP20K
Maui 0.273 0.240
CopyRNN 0.333 0.262
SKE-RNN-Cls 0.339 0.296
Table 5: We conduct experiment using RNN word representation based Span Keyphrase Extraction  classification model named SKE-RNN-Cls. SKE-RNN-Cls achieves 1.8% and 12.9% improvements over CopyRNN on F1@5 and F1@10 respectively.

Results and Analysis

For a fair comparison, the micro-averaged F-measure is adopted following chen2018keyphrase chen2018keyphrase. As the standard definition, precision is defined as the number of correctly-predicted keyphrases over the number of all predicted keyphrases, and recall is computed by the number of correctly predicted keyphrases over the total number of data records. F-measure is the harmonic mean of precision and recall.

Table 2 provides the performance of seven baseline models, as well as our proposed models. For each method, the table lists the F-measure at top 5 and top 10 results. The best scores are highlighted in bold.

The results show that the two unsupervised models (Tf-Idf, TextRank) have a robust performance across all datasets. The performance of the two supervised models (i.e., Maui and KEA) were unstable on some datasets. Two generation models outperform the previous baseline by a large margin, indicating the effectiveness of RNN with a copy mechanism. As no results are reported on Inspec and KP20K in chen2018keyphrase chen2018keyphrase, we thus ignore their performance in the table. By using coverage mechanism and review mechanism, CorrRNN beats CopyRNN on three datasets.

We denote the sentence pair classification method based on BERT-Base as BERT-Base-Pair. BERT-Base-Pair outperforms generation method in three datasets. For KP20k, BERT-Base-Pair achieves 12.01% and 19.47% performance gain on and respectively compared to CorrRNN.

We name our model with the BERT model and the training task. For example, SKE-Base-Cls and SKE-Base-Rank corresponding to the classification and ranking methods based on BERT-Base respectively. Results show our methods achieve state-of-the-art performance on all datasets except Krapivin datset. Our proposed method achieve at least 10% performance gain on NUS, SemEval and KP20k datasets compared to generation based method. Our model also beats the sentence pair classification model on all datasets, indicating the effectiveness of the span-based feature representations.


: Motion estimation using modified dynamic programming

Abstract: Correspondence vector-field computation is formulated as a matching optimization problem for multiple dynamic images. The proposed method is a heuristic modification of dynamic programming applied to the 2-D optimization problem. Motion vector field estimates using real movie images demonstrate good performance of the algorithm in terms of dynamic motion analysis.
SKE-Large-Cls: motion estimation; dynamic programming; correspondence vector field; motion vector field; dynamic motion analysis; motion vector field estimates; moving objects; matching optimization; modified dynamic programming; motion analysis;
Title: A Weighted Ranking Algorithm For Facet-Based Component Retrieval System
Abstract: Facet-based component retrieval techniques have been proved to be an effective way for retrieving. These Techniques are widely adopted by component library systems, but they usually simply list out all the retrieval results without any kind of ranking. In our work, we focus on the problem that how to determine the ranks of the components retrieved by user. Factors which can influence the ranking are extracted and identified through the analysis of ER-Diagram of facet-based component library system. In this paper, a mathematical model of weighted ranking algorithm is proposed and the timing of ranks calculation is discussed. Experiment results show that this algorithm greatly improves the efficiency of component retrieval system..
SKE-Large-Cls: component retrieval; facet; weighted ranking; ranks; er diagram; weighted ranking algorithm; component library; component library system; ranking algorithm; component retrieval system;
Table 6: Top10 phrases provided by SKE-Large-Cls. Phrases in bold are correct. Phrases in italic and bold are overlapped keyphrases and corresponding texts are also bold and italic in the document above.

Training Time

We show the training time of sentence pair classification and our method in Table 3. Our models achieve significant performance gain and are 66 times faster than sentence pair classification method which is proportional to the size of the training dataset. As the sentence pair classification model takes too much time to train based on BERT-Large, limiting its usage in practical applications, so we don’t report the training time and performance.

Word Representation with RNN

We also conduct a experiment using Glove [21] with RNN word representation other than BERT. We compared this model with CopyRNN on the largest test dataset KP20K. Result is shown in Table 5. SKE-RNN-Cls achieves 1.8% and 12.9% improvements over CopyRNN on F1@5 and F1@10 respectively.

Overlapped Keyphrases

To better evaluate our model’s ability of identifying overlapped keyphrses, we make some statics on the testing dataset. There are 16.67% and 19.96% overlapped keyphrase in the NUS and SemEval, respectively. The results show that our SKE-Large-CLS model could extract 69.70% and 73.77% of the overlapped keyphrases in these two datasets, which illustrates the ability of our model on extracting overlapped keyphrases.


Case Study

We compare the phrases provided by CorrRNN and SKE-Large-Cls on an example article shown in Table 4. Compared to CorrRNN, SKE-Large-Cls provided two more keyphrases “sip” and “loudness number” which cover two important topics. Moreover, SKE-Large-Cls presents a better ranking of keyphrases(i.e. SKE-Large-Cls provides three keyphrases at Top4 results while CorrRNN only provides one). Though using coverage mechanism and review mechanism, CorrRNN still produces continuous phrases that have the same prefix(i.e. “audio conferencing” and “audio service”, “real time services” and “real time”), showing that generation method fails to obtain the semantic meanings of phrases as a whole.

We show another example in Table 6 in which SKE-Large-Cls provided two overlapped keyphrases. For the first example, “motion vector field estimates” and “motion vector field” both are keyphrases and overlapped in the document. The same situation appeared in the second example with keyphrases “weighted ranking algorithm” and “ranking algorithm”. SKE-Large-Cls obtains them all, showing the ability to process the overlapped keyphrases while this situation is hard to handle using the sequence labeling method.

Applicable Task

In this paper, we mainly focus on keyphrase extraction task, whereas our proposed method is applicable to other natural language tasks such as named entity recognition. For named entity recognition task, existing method

[14] uses a sequence labeling model to tagging the candidate entity using “B”, “I”, “O” and another multi-class multi-label classification model to type the entity, e.g. organization, person. However, sequence labeling can hardly solve the overlapped name entities (e.g. “BMW X1” and “BMW” are a car brand and a car series respectively, showing different granularity of information) which can be solved using our method. Furthermore, by changing the candidate selection stage in this paper, we can obtain an end-to-end method for keyphrase extraction and named entity recognition. For example, given a document, we can split it by symbols that hardly appear in an entity, such as comma, period. Then, all possible spans in the splits are selected as candidates.

Conclusion and Future Work

The key idea of Span Keyphrase Extraction  model is to build span-based feature representation for keyphrases. Span-based approach is capable of tackling overlapping keyphrases effectively. Furthermore, our proposed model is able to utilize context information based on an overall understanding of the document and a better considering of interactions between phrases by Bi-LSTM models. Comprehensive empirical studies demonstrate the effectiveness of our proposed model on four benchmark datasets.

Future works focus on three aspects: (i) Our model adopt two-step approach, so it cannot avoid error accumulation. Motivated by this, We are pursuing a method to improve the coverage (e.g. a more exhaustive part-of-speech pattern, using all span limited by a threshold length as candidate phrases and using all possible spans in segment split by symbols). (ii) We adopt our model in other natural language processing tasks, (e.g. named entity recognition or automatic text summarization) which need to extract continuous texts for classification or ranking. (iii) Our proposed model views the same phrases appear in different positions as difference phrases and fails to utilize this valuable information. It would be interesting to explore a new model structure to solve this problem.


  • [1] L. J. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. CoRR abs/1607.06450. Cited by: Token Features.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In Proc. ICLR, Cited by: Related Work.
  • [3] J. Chen, X. Zhang, Y. Wu, Z. Yan, and Z. Li (2018) Keyphrase generation with correlation constraints. In Proc. EMNLP, pp. 4057–4066. Cited by: Introduction, Related Work, Baseline Model.
  • [4] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL-HLT, pp. 4171–4186. Cited by: Introduction, Related Work.
  • [5] F. Ferrara, N. Pudota, and C. Tasso (2011) A keyphrase-based paper recommender system. In Digital Libraries and Archives - 7th Italian Research Conference, IRCDL 2011, Pisa, Italy, January 20-21, 2011. Revised Papers, pp. 14–25. Cited by: Introduction.
  • [6] J. Gu, Z. Lu, H. Li, and V. O. K. Li (2016) Incorporating copying mechanism in sequence-to-sequence learning. In Proc. ACL, Volume 1: Long Papers, Cited by: Related Work.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. CVPR, pp. 770–778. Cited by: Token Features.
  • [8] D. Hendrycks and K. Gimpel (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415. Cited by: Token Features.
  • [9] A. Hulth (2003)

    Improved automatic keyword extraction given more linguistic knowledge

    In Proc. EMNLP, Cited by: Related Work, Dataset.
  • [10] M. Krapivin, A. Autaeu, and M. Marchese (2009) Large dataset for keyphrases extraction. Technical report University of Trento. Cited by: Dataset.
  • [11] T. T. N. Le, M. L. Nguyen, and A. Shimazu (2016) Unsupervised keyphrase extraction: introducing new kinds of words to keyphrases. In

    AI 2016: Advances in Artificial Intelligence - 29th Australasian Joint Conference, Hobart, TAS, Australia, December 5-8, 2016, Proceedings

    pp. 665–671. Cited by: Introduction, Related Work, Dataset, Training Details.
  • [12] Q. Li, Y. Wu, R. S. Bot, and X. Chen (2004) Incorporating document keyphrases in search results. In 10th Americas Conference on Information Systems, AMCIS 2004, New York, NY, USA, August 6-8, 2004, pp. 410. Cited by: Introduction.
  • [13] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017) A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130. Cited by: Token Features.
  • [14] X. Ling and D. S. Weld (2012) Fine-grained entity recognition. In Proc. AAAI, Cited by: Applicable Task.
  • [15] Z. Liu, X. Chen, Y. Zheng, and M. Sun (2011) Automatic keyphrase extraction by bridging vocabulary gap. In Proc. CoNLL, pp. 135–144. Cited by: Related Work.
  • [16] Z. Liu, W. Huang, Y. Zheng, and M. Sun (2010) Automatic keyphrase extraction via topic decomposition. In Proc. EMNLP, pp. 366–376. Cited by: Related Work.
  • [17] O. Medelyan, E. Frank, and I. H. Witten (2009) Human-competitive tagging using automatic keyphrase extraction. In Proc. EMNLP, pp. 1318–1327. Cited by: Introduction, Related Work, Baseline Model.
  • [18] R. Meng, S. Zhao, S. Han, D. He, P. Brusilovsky, and Y. Chi (2017) Deep keyphrase generation. In Proc. ACL, Volume 1: Long Papers, pp. 582–592. Cited by: Introduction, Related Work, Dataset.
  • [19] R. Mihalcea and P. Tarau (2004) Textrank: bringing order into text. In Proc. EMNLP, Cited by: Related Work, Baseline Model.
  • [20] T. D. Nguyen and M. Kan (2007) Keyphrase extraction in scientific publications. In International conference on Asian digital libraries, pp. 317–326. Cited by: Dataset, Dataset.
  • [21] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: Word Representation with RNN.
  • [22] K. Toutanova, D. Klein, C. D. Manning, and Y. Singer (2003) Feature-rich part-of-speech tagging with a cyclic dependency network. In Proc. HLT-NAACL, Cited by: Dataset.
  • [23] Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li (2016) Modeling coverage for neural machine translation. arXiv preprint arXiv:1601.04811. Cited by: Related Work.
  • [24] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: Token Features.
  • [25] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin, and C. G. Nevill-Manning (2005) Kea: practical automated keyphrase extraction. In Design and Usability of Digital Libraries: Case Studies in the Asia Pacific, pp. 129–152. Cited by: Introduction, Introduction, Related Work, Baseline Model.
  • [26] Q. Yin, Y. Zhang, W. Zhang, T. Liu, and W. Y. Wang (2018) Zero pronoun resolution with attention-based neural network. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 13–23. Cited by: Token Features.
  • [27] F. Zhang, B. Peng, et al. (2013) WordTopic-multirank: a new method for automatic keyphrase extraction. In Proceedings of the Sixth International Joint Conference on Natural Language Processing, pp. 10–18. Cited by: Related Work.
  • [28] Q. Zhang, Y. Wang, Y. Gong, and X. Huang (2016) Keyphrase extraction using deep recurrent neural networks on twitter. In Proc. EMNLP, pp. 836–845. Cited by: Introduction.