Keyphrase extraction (KE) is the task of automatically extracting descriptive phrases or concepts that represent the main topics of a document. Keyphrases associated with a document have shown to improve many natural language processing and information retrieval tasks. Due to their importance, numerous approaches to KE have been proposed along two lines of research: supervised and unsupervised.
In supervised approaches, a candidate phrase is encoded with a set of features, such as its tf-idf
, position in the document or part-of-speech tag, and machine learning algorithms are trained to classify it as either positive (i.e., keyphrases) or negative (i.e., non-keyphrases). The supervised line of research has mainly been concerned with the design of various hand-crafted features which may yield better classifiers that more accurately predict the keyphrases of a document. For example, hulth2003improved hulth2003improved used a combination of lexical and syntactic features, such as the collection frequency and the part-of-speech tag of a phrase; medelyan2009human medelyan2009human designed features to include the spread of a phrase and Wikipedia-based keyphraseness. Although these features have shown to work well in practice, many of them are computed based on statistical information collected from the training documents which are less suited outside of that domain. Thus, it is essential to automatically discover such features or representations without relying on feature engineering which normally does not generalize well.
An intuitive approach to replace the process of feature design is to use the existing neural language models [Mikolov, Yih, and Zweig2013]. Learning word representations from large collections of text is an efficient way to capture rich semantic and syntactic information about words. However, even if these representations may carry a lot of information, they are not effective enough if directly used for KE. Similar to the design of hand-crafted features which are built on various observations, a feature learning process should exploit the structure of the text itself to discover patterns that keyphrases exhibit. Recently, perozzi2014deepwalk perozzi2014deepwalk introduced a generalization of the neural language models that allows for the exploration of a network structure through a stream of short random walks. Applying a similar line of reasoning to word graphs built from natural language documents results in representations that capture characteristics of words in relation to the topology of the document.
Inspired by this work, we propose SurfKE, a supervised approach to KE that represents a document as a word graph and learns feature representations of graph nodes. Our experiments on a synthesized dataset show that: (1) our approach has the potential to automatically learn informative features for KE; (2) our model that uses its self-discovered features obtains better results than strong baselines.
Our model involves three essential steps, as detailed below.
Graph Construction. Let be a target document for extracting keyphrases. We build an undirected word graph , where each unique word that passes the part-of-speech filters corresponds to a vertex . Two vertices and are linked by an edge if the words corresponding to these vertices co-occur within a window of contiguous tokens in . The weight of an edge (denoted as ) is computed based on the co-occurrence count of the two words within a window of successive tokens in .
Learning Feature Representations. Let be an undirected graph constructed as above. Our goal is to learn a mapping function , where is a parameter specifying the number of dimensions for the embeddings. This function represents the latent representations associated with each vertex . Similar to neural language models, the network feature learning model needs a corpus and a vocabulary in order to learn a mapping of nodes to a low-dimensional space of features. We define our vocabulary as the set of graph vertices, where the graph is built from the target document. To generate a corpus for our algorithm, we leverage a biased sampled random walk strategy. More precisely, let us consider an arbitrary node and = a biased random walk of length starting at vertex (). Each node ,
is generated by the following transition probability distribution:
where is the weight of edge and is the sum of weights over all edges in the graph. Hence, for each node
, we sample a fixed number of biased random walks with respect to the weight of edges and use them as contexts to learn the distributed representation of words.
Candidate words that have contiguous positions in a document are concatenated into phrases. The feature vector for a multi-word phrase is obtained by taking the mean of the vectors of words constituting the phrase.
Experiments and Results
Datasets and Evaluation Measures. To evaluate the performance of our model, we carried out experiments on a synthesized dataset including news articles and research papers from several domains. The synthesized dataset contains 1308 documents collected from two benchmark datasets, DUC [Wan and Xiao2008] and Inspec [Hulth2003], and a set of documents from the MEDLINE/PubMed database. The human assigned keyphrases available for each dataset were used as gold standard for evaluation.
We measure the performance of our model by computing Precision, Recall and F1-score. The evaluation metrics were averaged in a 10-fold cross-validation setting where the training and test sets were created at the document level. The model parameters such as the vector dimension or the number of walks per node were estimated on a validation set. We used the Gaussian Naïve Bayes classifier implemented in the scikit-learn library to train our model. Similar to our baselines, we evaluate the top 10 predicted keyphrases returned by the model, where candidates are ranked based on the confidence scores as output by the classifier.
Results and Discussion. We compare SurfKE with two supervised models, KEA [Frank et al.1999] and Maui [Medelyan, Frank, and Witten2009], and two unsupervised models, KPMiner [El-Beltagy and Rafea2010] and PositionRank [Florescu and Caragea2017]. KEA is a representative system for keyphrase extraction with the most important features being the frequency and the position of a phrase in a document. Maui is an extension of KEA which includes features such as keyphraseness, the spread of a phrase and statistics gathered from Wikipedia. KPMiner uses statistical observations to limit the number of candidates and ranks phrases using the tf-idf model in conjunction with a boosting factor which aims at reducing the bias towards single word terms. PositionRank is an unsupervised graph-based model, that incorporates the position information of a word’s occurrences into a biased PageRank to score words that are later used to score and rank candidate phrases.
Table 1 shows the comparison of SurfKE with the four baselines described above. As we can see in the table, SurfKE substantially outperforms all baseline approaches on the synthesized dataset. For instance, SurfKE obtains an F1-score of compared to and achieved by the best performing baselines, PositionRank and Maui, respectively. With relative improvements of 50.68% 33.33%, 15.18% and 8.91% over KEA, KPMiner, Maui and PositionRank respectively, the Precision of our model is significantly superior to that of the other models.
Conclusion and Future Work
We propose SurfKE, a supervised model for KE that represents the target document as a word graph and uses a biased sampled random walk model to generate short sequences of nodes which are later used as contexts to learn the node representations. Our experiments show that (1) our approach has the potential to exploit the word graph to capture those ”central” terms that represent the text; (2) SurfKE obtains remarkable improvements in performance over strong baselines for KE. In future work, we would like to explore other potential neighborhoods of a node for guiding random walks, e.g., those words that appear early in the document or more topically relevant to the text.
- [El-Beltagy and Rafea2010] El-Beltagy, S. R., and Rafea, A. 2010. Kp-miner: Participation in semeval-2. In SemEval’10.
- [Florescu and Caragea2017] Florescu, C., and Caragea, C. 2017. Positionrank: An unsupervised approach to keyphrase extraction from scholarly documents. In ACL’17, 1105–1115.
- [Frank et al.1999] Frank, E.; Paynter, G. W.; Witten, I. H.; Gutwin, C.; and Nevill-Manning, C. G. 1999. Domain-specific keyphrase extraction. In IJCAI’99, 668–673.
Improved automatic keyword extraction given more linguistic knowledge.In EMNLP’03.
- [Medelyan, Frank, and Witten2009] Medelyan, O.; Frank, E.; and Witten, I. H. 2009. Human-competitive tagging using automatic keyphrase extraction. In EMNLP’09, 1318–1327.
- [Mikolov, Yih, and Zweig2013] Mikolov, T.; Yih, W.-t.; and Zweig, G. 2013. Linguistic regularities in continuous space word representations. In HLT-NAACL, volume 13, 746–751.
- [Perozzi, Al-Rfou, and Skiena2014] Perozzi, B.; Al-Rfou, R.; and Skiena, S. 2014. Deepwalk: Online learning of social representations. In KDD, 701–710.
- [Wan and Xiao2008] Wan, X., and Xiao, J. 2008. Single document keyphrase extraction using neighborhood knowledge. In AAAI’08, 855–860.