Fast Top-k Area Topics Extraction with Knowledge Base

10/13/2017 ∙ by Fang Zhang, et al. ∙ Beihang University Tsinghua University 0

What are the most popular research topics in Artificial Intelligence (AI)? We formulate the problem as extracting top-k topics that can best represent a given area with the help of knowledge base. We theoretically prove that the problem is NP-hard and propose an optimization model, FastKATE, to address this problem by combining both explicit and latent representations for each topic. We leverage a large-scale knowledge base (Wikipedia) to generate topic embeddings using neural networks and use this kind of representations to help capture the representativeness of topics for given areas. We develop a fast heuristic algorithm to efficiently solve the problem with a provable error bound. We evaluate the proposed model on three real-world datasets. Experimental results demonstrate our model's effectiveness, robustness, real-timeness (return results in <1s), and its superiority over several alternative methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automatically extracting top- topics of a given area is fundamental in the historical analysis of the given area. With the ability of solving this problem, not only can we gain an accurate overview of the given area, but it can also help make our society more efficient, such as giving suggestions on how to optimize the allocation of resources (e.g., research fundings) to more representative and important topics. This can also provide guidances to newcomers of the area. However, there are too many topics in almost any areas, and for any researcher, it is non-trivial for him/her to extract the top- topics of the given area in a short period of time, especially if the researcher is a newcomer to the area. Therefore it is important to find a way to automatically solve this problem.

While much research has been conducted on the topic extraction problem, their main focus is basically on document topic extraction, but not on area topic extraction. For example, in  [Blei, Ng, and Jordan2003, Griffiths and Steyvers2004], latent dirichlet allocation (LDA) model is used to model topics in documents and abstracts, where topics are represented as multinomial distributions over words. Topics can also be represented as keyphrases (or topical phrases), and under this perspective, keyphrases extraction task can also be viewed as topic extraction task. Different models such as frequency-based [Salton and Buckley1997], graph-based [Mihalcea and Tarau2004], clustering-based [Grineva, Grinev, and Lizorkin2009] and so on have been explored to address the keyphrase extraction problem, but still focus on a document instead of an area.

The problem of area topic extraction is novel, non-trivial and poses a set of unique challenges as follows: (1) How to formulate the problem and using what kind of datasets and how to use it is not clear. (2) How to capture the representativeness of topics for a given area is another challenging issue. (3) The number of candidate topics in a given area may be very large. There are 14,449,404 page titles (including categories) in Wikipedia and even after we do some preprocessing on it, we still get 9,355,550 topics. Thus how to develop an efficient algorithm to apply it in practice is important too. (4) Since there are no standard benchmarks that can perfectly match this problem, how to quantitatively evaluate the results is also a challenging issue.

To address these challenges, in this paper, we give a formal definition of the problem and develop an optimization model to efficiently solve it. Our contributions can be summarized as follows:

  • To the best of our knowledge, this is the first attempt to formulate and address the area topic extraction problem. We formulate the problem as extracting top- topics that can best represent a given area with the help of knowledge base. We theoretically prove that the problem is NP-hard.

  • We propose an optimization model, FastKATE, to address this problem by combining both explicit and latent representations for each topic. We leverage a large-scale knowledge base (Wikipedia) to generate topic embeddings using neural networks and use this kind of representations to help capture the representativeness of topics for given areas. We develop a fast heuristic algorithm to efficiently solve the problem with a provable error bound.

  • We evaluate the proposed model on three real-world datasets. Experimental results demonstrate our model’s effectiveness, robustness, real-timeness (return results in s), and its superiority over several alternative methods.

2 Problem Formulation

We first provide necessary definitions and then formally define the problem.

Definition 1.

Knowledge Base and Topic. A knowledge base is represented as a triple , where represents a set of knowledge concepts, and we also view this as topics in this paper. represents a set of relations between topics. represents a set of co-existences between topics, i.e., each is a sequence of topics , where .

This definition is a variation of that in [McGuinness, Van Harmelen, and others2004, Tang et al.2015]. In our work, represents a corpus consisting of massive documents, and each document is a sequence of topics. Relations may have various types; we focus on sub-topic and super-topic relations in our work.

Each topic in a knowledge base already has a corresponding topical phrase, such as “Artificial Intelligence”. To help grasp the relations/similarities between these topical phrases, we also represent each topic

as a vector

in a latent feature space, where is the dimension of the feature space, which will be detailed in section 3.1. Thus each topic in our work has both explicit representation (i.e., topical phrase) and latent representation (i.e., vector).

Definition 2.

Area. In this paper, an area is essentially also a topic in . Thus it has the same form and attributes as other topics in

. An area may be also a topic of some other area. For example, Machine Learning is an area, and it can also be viewed as a topic of Artificial Intelligence area.

We leverage a knowledge base to help extract topics from a given area in our work. We formally define the problem as follows.

Problem 1.

Extracting top- topics in a given area.

The input of this problem includes an external knowledge base , a given area and the number of topics needed to be extracted.

The output of this problem is a set of top- topics which can represent the given area best.

Our goal is to learn a function from the given input so as to extract the top- topics which can represent the given area best. More specifically, is defined as:


This problem is equivalent to selecting topics from the topics set that can represent the given area best. We use to denote the degree of how well a set of topics can represent all topics in on the given area . Without loss of generality, we assume . And since adding new topics to should not reduce the representativeness of previous extracted topics, should be monotonically non-decreasing. We also assume reasonably that topics added in early steps should not help (actually it may damage) topics added in later steps increase the value of the goal function. This means that a topic added in later steps contribute equal or possibly less to the goal function compared with that when the same topic is added in early steps. This is intuitive and reasonable because if a topic is added in later steps, some previous added topics may already have a good representation of the area, and thus this topic’s contribution to the goal function may be decreased. We will show that this attribute implies the goal function’s submodulariry [Svitkina and Fleischer2011] in the following section. Then our problem can be reformulated as follows:

where is a non-negative and monotonically non-decreasing function.

3 The Proposed Model

We propose FastKATE (Fast top-K Area Topics Extraction) to address the problem. In general, FastKATEnot only represents topics in explicit forms (phrases) as in knowledge bases, but also represents topics as vectors in a latent feature space, and uses a neural network-based method to learn topic embeddings from an external large-scale knowledge base. FastKATEfurther incorporates domain knowledge from the knowledge base to assign “general weights” to different topics to help solve the problem. We develop a heuristic algorithm to efficiently solve the defined problem and we prove our algorithm is at least of the optimal solution. We further develop a fast implementation of our algorithm which can return results in real-time.

3.1 Topics Representation

We first generate and use it as candidate topics and then train embeddings for each topic . We use Wikipedia as our knowledge base to help generate candidate topics and train topics embeddings. We extract 14,449,404 titles of all articles and categories from Wikipedia, and convert them into lower forms and remove possible duplicates and those consisting of punctuations from these titles. Finally we get 9,355,550 titles as candidate topics. Then we use an unsupervised neural network-based method to learn the embeddings of these topics. We then preprocess the Wikipedia corpus to keep only candidate topics in the corpus, and use the preprocessed Wikipedia corpus as our training data. We adopt a similar method to that used in Word2Vec [Mikolov et al.2013]. We treat each topic as a single token, and use a Skip-Gram model to generate each topic’s embedding. In the Skip-Gram model, the training objective is to find topic embeddings that are useful for predicting surrounding topics. More formally, given a sequence of training topics,

, the objective of the Skip-Gram model is to maximize the average log probability

where is the size of the training context (also denoted as window size), and is defined using the softmax function:

where and are the embeddings of “input topic” and “output topic” respectively, and is the number of total candidate topics. Because is very large, this calculation is very computationally expensive. Thus we adopt a common approximation in our model: Negative Sampling (NEG) [Mikolov et al.2013], which can speed up the training process greatly. Using NEG, is replaced by:

where , is the noise distribution of topics, and is the number of negative samples of each topic. Thus the task is to distinguish the target topic from draws from the noise distribution . We also do subsampling [Mikolov et al.2013] of frequent topics in our model to counter the imbalance between rare and frequent topics: each topic in the training set is discarded with probability computed by the formula:

where is the frequency of topic and is a chosen threshold, typically around .

Input: Area , knowledge base , the number of topics to extract, general weights of topics in .
Output: The top- topics set .
; while  do
       ; foreach  do
             = 0; foreach  do
                   += ;
            if  then
                   ; ;
return ;
Algorithm 1 Top- Area Topics Extraction

3.2 Top- Area Topics Extraction

As stated in section 2, our problem is formulated as an optimization problem:


where is a function that denotes the degree of how well a set of topics can represent all topics in on the given area .


We first prove the problem is NP-hard by reducing Dominating Set Problem[Karp1972, Gary and Johnson1979] to this problem as follows.


For , we first define the relativeness between as , and if , we assign an undirected edge between and ; otherwise, there is no edge between and . Thus we get get an undirected graph of all concepts in , where is the set of all edges in .

Then we define as: if such that ; otherwise. And then we define as:

We then show that if we can find the maximum value , we can also decide that for the given number , whether there exists a dominating set where and such that . The reduction process is as follows: we compare with which is the number of concepts in , and according to our definition of and , it must hold that . If , then , such that , which means there exists a dominating set such that ; if , then , such that , , which means there does not exist a dominating set such that . ∎

Heuristic Algorithm.

Since the problem is NP-hard, we propose an approximate heuristic algorithm in our model to solve it, as outlined in Algorithm 1, and detailed as follows. The main idea is that we select topics one by one, and in the -th step, we select topic such that

where is the selected topics set before the -th step. To calculate , we introduce the general weight to measure the importance of topic in the given area . We call general weight because this value will be set by utilizing some domain knowledge and may probably be not very precise and can only measure the importance of in area to some general extent. We will demonstrate the calculation process of in the following part. Then we define as:

and define as:

where represents the relativeness between and .

After we get the embeddings of topics in section 3.1, we can calculate as follows:

where and are the embeddings of and respectively.

General Weight Calculation.

To calculate the general weight of topic in the given area , we incorporate the domain knowledge from an external large-scale knowledge base into our model. This shares a similar idea as Distant Supervision [Mintz et al.2009]. We still use Wikipedia as our knowledge base here, and use category information of the given area as the domain knowledge to help calculate . The idea behind the calculation of general weight is that topics in shallower depth of subcategories of are probably more important in area . More specifically, we calculate in the following steps:

  • Find the category that represents in Wikipedia, which is also denoted as .

  • For the given area , extract its all subcategories recursively from Wikipedia, where is the root category and represents the -th subcategory in depth .

  • Calculate the general weight of topic as: , where is the depth of topic in ’s subcategories if ; otherwise (or equivalently set if we want to put all topics in ). is a monotonically decreasing function of , and can be selected empirically.

Input: Area , high-quality candidate topics set , contributive topics set , the number of topics to extract, general weights of topics in .
Output: The top- topics set .
; while  do
       ; foreach  do
             = 0; foreach  do
                   += ;
            if  then
                   ; ;
return ;
Algorithm 2 Fast Top- Area Topics Extraction

3.3 Algorithmic Analysis

We argue that Algorithm 1 has at least an -approximate of the original NP-hard problem. We first prove that the goal function of the original optimization problem is non-negative, monotonically non-decreasing, and submodular, and then we use these properties to prove its error bound. By definition the goal function is non-negative and monotonically non-decreasing; thus we only show its submodularity as follows.


As stated before, the problem is formulated as follows:

where is the goal function which represents the degree of how well topics set can represent in the given area . For a given topic , we first denote , which means the increment to the goal function by adding to . Then we add a topic and to , and denote . By the attribute of we assume in section 2, we have , which means the goal function is submodular.

Since the goal function of our problem is monotonically increasing, nonnegative and submodular, the solution generated by Algorithm 1 is at least of the optimal solution [Nemhauser, Wolsey, and Fisher1978, Kempe, Kleinberg, and Tardos2003].

width=1 Dataset Area AI CV ML NLP SE Metric Presion@15 MAP Presion@15 MAP Presion@15 MAP Presion@15 MAP Presion@15 MAP ACM CCS TFIDF 0.1333 0.0144 0.0000 0.0000 0.3333 0.1560 - - 0.2666 0.1286 LDA 0.2667 0.1696 0.0000 0.0000 0.2667 0.1020 - - 0.2000 0.1032 TextRank 0.4000 0.2556 0.0000 0.0000 0.3333 0.1308 - - 0.3333 0.1830 FastKATE-1 0.4000 0.1551 0.0667 0.0056 0.3333 0.1183 - - 0.4667 0.4137 FastKATE-2 0.4000 0.1797 0.1333 0.0231 0.3333 0.1896 - - 0.5333 0.4994 Microsoft FoS TFIDF 0.1333 0.0333 0.0667 0.0222 0.4667 0.2864 0.1333 0.0933 0.0667 0.0167 LDA 0.2667 0.2130 0.2667 0.0989 0.4000 0.2901 0.1333 0.0889 0.2000 0.0375 TextRank 0.4000 0.2600 0.2667 0.1302 0.4667 0.3529 0.1333 0.1000 0.3333 0.1077 FastKATE-1 0.4000 0.4606 0.2667 0.2056 0.5333 0.3417 0.2000 0.1444 0.4000 0.1775 FastKATE-2 0.4667 0.2193 0.3333 0.2405 0.5333 0.3522 0.2000 0.1511 0.4667 0.2658 Domain Experts TFIDF 0.1333 0.0194 0.2000 0.0556 0.6667 0.4360 0.3333 0.2321 0.2000 0.0952 LDA 0.3333 0.2130 0.4000 0.1838 0.6000 0.4979 0.4000 0.2706 0.2667 0.1261 TextRank 0.4667 0.3750 0.4000 0.2183 0.7333 0.5666 0.4000 0.2853 0.4000 0.2189 FastKATE-1 0.6000 0.4204 0.4667 0.3029 0.6667 0.5202 0.5333 0.3831 0.5333 0.4492 FastKATE-2 0.7333 0.6097 0.6000 0.4389 0.8000 0.6710 0.5333 0.4020 0.6000 0.5394

Table 1: Performances of different methods in our experiment. Because there are only 8 nodes in the sub-tree of NLP in ACM CCS dataset, we leave it empty here. SE here corresponds to Software and its engineering in ACM CCS dataset.

3.4 Fast Implementation

The time complexity of Algorithm 1 is , where is the number of topics needed to be extracted and is the number of elements in . In practical use, , but may be (tens of) millions of order of magnitude (i.e., we extract 9,355,550 candidate topics from Wikipedia as mentioned above). Thus Algorithm 1 still seems infeasible and may take unbearable time to return results (which is actually the case in our experiments). However, we observe the following two facts:

  • Most of the candidate topics in the whole set are not relevant to a given area.

  • When the general weight of a topic is little enough, this topic’s contribution to the whole sum ( in Algorithm 1) may be little enough too.

From the above two observations, we think of the following two strategies which can greatly speed up our algorithm:

  • We only keep topics within a depth in the given area’s category as high-quality candidate topics from the original set.

  • Since the general weight function of a topic is monotonically decreasing with the topic’s depth , thus we can choose a depth (with a well-defined ) such that the contributions of all topics below this depth are small enough and can be discarded without calculation.

And this can lead to a much faster algorithm with time complexity , as summarized in Algorithm 2, where and represent high-quality candidate topics set and contributive topics set respectively, and in practical use we have .

4 Experimental Results

We train our model on one of the largest public knowledge base (Wikipedia). As there are no standard datasets with ground truth and also it is difficult to create such a data set of ground truth, for evaluation purpose, we collect three real-world datasets and choose five representative areas in computer science: Artificial Intelligence (AI), Computer Vision (CV), Machine Learning (ML), Natural Language Processing (NLP), and Software Engineering (SE) to compare the performance of our model with several alternative methods. But our model is not restricted to these areas and can be applied to any other areas theoretically. The datasets and codes are publicly available, and a demo is ready

111—Inputs are: (1) : area name in the form of a topical phrase (words are connected by a underline, such as “artificial_intelligence”). (2) : the number of topics needed to be extracted. Outputs are: (1) Extracted topics ranked by and accompanied with their scores ( in Section 3.2). (2) Running time.

4.1 Datasets

We download Wikipedia data from wikidump222 as our knowledge base , use its (preprocessed) titles of all articles and categories as , use the text of all articles as and use its category structures as . After we preprocess the titles as stated in section 3.1, we get 9,355,550 candidate topics in . As stated in section 3.1, we use full text of Wikipedia to train topics embeddings and view each topic as a whole in Word2Vec model, and we use Gensim333 to help implement our model. The parameter settings are as follows: vector size , window size , min count of each topic , threshold () for downsampling , min sentence length , num workers ; for other parameters, we use default settings in Gensim. The collected three real-world datasets for evaluation are detailed as follows.

ACM CCS classification tree. ACM CCS classification tree444 is a poly-hierarchical ontology and contains 2,126 nodes in total. In this tree (actually a directed acyclic graph), each node can be viewed as a topic and each non-leaf node has several children nodes as its sub-topics. Although different nodes may have different number and different granularity of nodes in its sub-tree, it still provides us a guidance that what may be top topics in a given area.

Microsoft Fields of Study (FoS). Microsoft Fields of Study (FoS) from its Microsoft Academic Graph (MAG)555 is a directed acyclic graph where each node also represent a topic and it contains 49,038 nodes in total. Each node in the graph is accompanied with a “level” representing its depth/granularity in the graph. The network has 4 different levels in total. Each node has super-nodes of different levels as its super-topics and each super-topic is accompanied with a confidence value. The confidences of all super-nodes of the same level of one topic sum to . This dataset can also provide us a guidance that what may be top topics in a given area.

Domain Experts Annotated Dataset. As there are actually no standard datasets/benchmarks which perfectly match our problem, we also let domain experts directly annotate top- () topics in the five given areas without giving any single dataset for reference.

For the first two datasets, we let domain experts select top- topics based on each given area’s sub-topics to match our problem better. Because there may be too much nodes in certain area’s sub-topics, we instruct domain experts to first select a larger set of topics than needed and then do secondary screening from them. Since we need to do annotations in all three datasets, we set up the following criterions to reduce subjectivity in the annotation process and help domain experts reach an agreement:

  • Selected topics should be more significant than other topics in the given area.

  • Selected topics should cover the whole given area as far as possible. This implies that they should not be too similar with each other in the given area, such as Artificial Neural Networks and Neural Networks should be viewed as the same topic in AI area.

After we get the results of each domain expert, we count the number of each selected topic and rank them by their counts, and choose the top- from them as the ground truth of the given area. We empirically set in our experiment.

4.2 Evaluation Metrics

To quantitatively evaluate the proposed model, we consider the following two metrics.


Since the number of extracted results are set to the same for domain experts and machines, we use Presion@ to measure the performance of different methods. Since the order is also important in the extracted results, we introduce another metric as follows.

Mean Average Precision (MAP).

For a single result (such as a ranked list in our experiments), AP is defined as follows:

where is the number of all correct items (i.e., the length of human-annotated ranked list); n is the length of the machine extracted ranked list (which is the same as in our experiments); equals 0 when the -th item is incorrect or equals the precision of the first items in the ranked list. MAP is then calculated by averaging the APs over all results.

width=1 TFIDF LDA TextRank FastKATE-2 Social Medium Robotics Robotics Machine Learning User Interface Virtual Reality Machine Learning Computational Linguistics Virtual Reality Machine Learning Semantic Web Knowledge Representation Classification System Semantic Web Natural Language Processing Artificial Life Facial Expression Speech Recognition Image Processing Ubiquitous Computing World Wide Web Turing Test Virtual Reality Computer Vision Remote Sensing Usability Speech Recognition Natural Language Processing Machine Learning User Interface Artificial Neural Network Multi Agent System Graphical User Interface Natural Language Processing User Interface Robotics Image Processing Knowledge Base Usability Expert System Unmanned Aerial Vehicle World Wide Web Knowledge Base Logic Programming Computer Vision Image Processing Knowledge Representation Deep Learning Knowledge Base Optical Character Recognition World Wide Web Fuzzy Logic Aerial Photography Handwriting Recognition Logic Programming Artificial Neural Network Speech Recognition Artificial Neural Network Fuzzy Logic Computational Mathematics

Table 2: Extracted topics () in AI area using different methods. Bold ones represent relatively correct ones.

4.3 Comparison Methods

For each given area , we first extract all its subcategories within a depth of (), and use them as candidate topics . We then extract all articles of these candidate topics from Wikipedia for LDA and TextRank methods here, where represents the corpus in Wikipedia as introduced in section 2.

  • Topic TF-IDF (TFIDF): We calculate each candidate topic’s tf-idf [Jones1973] value in the whole Wikipedia corpus (viewing each article as a document), and rank all candidate topics by these values.

  • LDA: We train LDA [Blei, Ng, and Jordan2003] model on all documents in . For each candidate topic (note that this is in the form of a topical phrase, and it is not the extracted topics in LDA which is actually multinomial distributions over words), we calculate its weight as follows:

    where is the number of topics extracted by the LDA model, is the probability of -th topic of the LDA model in the -th article, and is the probability of in -th topic of the LDA model. When training, we remove those documents with words. We utilize Gensim666 to help implement this model, and we use all default parameters of it except we set .

  • TextRank: We run TextRank [Mihalcea and Tarau2004] algorithm on each article in , and for each candidate topic (in the form of a topical phrase), we calculate its weight as follows:

    where is the weight generated by TextRank of in -th article of .

  • FastKATE: This is our model outlined in Algorithm 2. Due to the unbearable running time of Algorithm 1, we think it is impractical and thus do not compare its result with others. We empirically select . We select two different settings of to compare their performances and time costs: (1) (denoted as FastKATE-1), (2) (denoted as FastKATE-2).

width=1 Area AI CV ML NLP SE FastKATE-1 34.217s 4.533s 0.262s 0.101s 36.675s FastKATE-2 0.593s 0.122s 0.025s 0.009s 0.093s

Table 3: Average running time of our two models over 100 times runs. We can see that FastKATE-2 can return results in real-time.

4.4 Results and Analysis

Accuracy Performance.

Table 1 lists the performances of different methods used in the problem of extracting top- topics in a given area. In terms of Precision@, our model FastKATE-2 performs consistently the best on all three datasets and in all five areas. In terms of MAP, our model FastKATE-2 performs the best in cases. This suggests our model can not only extract more correct top- topics but also rank them in more accurate order. We can also see that FastKATE-1 (it is different from FastKATE-2 only in parameter settings) performs the second best in most cases, which suggests that even with different parameter settings, our model is still very effective comparing to other methods so that our model is also robust.

We note that average performances of all methods on the first two datasets (ACM CCS and Microsoft FoS) are worse than on the third dataset, which is annotated by domain experts specially for this problem. This is easy to understand since there are actually no existing datasets/benchmarks that can perfectly match this problem, and although the first two datasets are highly-related to the problem, they are not specialized for this purpose. And this is the reason that we annotate our own datasets with the help of domain experts directly, and we think the third dataset is more capable of reflecting the performances of different methods on this problem.

It is beyond our expectation that FastKATE-2 performs better than FastKATE-1 in most cases, because FastKATE-2 uses smaller contributive topics set than FastKATE-1 () and thus seems accessing less information than FastKATE-1. We think one possible reason is that the contributive topics set becomes more noisy when they go deeper, and when we restrict the depth to only one, we have cleaner data and thus may get better results. Besides, as stated in section 3.4, when the depth is smaller, our algorithm runs faster. We record the average running time of our two models over 100 times runs on all five areas in Table 3. We can see that FastKATE-2 is faster than FastKATE-1 and can return results in real-time.

4.5 Case Study

Table 2 lists extracted topics in AI area using TFIDF, LDA, TextRank and FastKATE-2 respectively. We can see that most extracted topics by our model are of high-quality and are more convincing compared to other methods.

5 Related Work

Our work is mainly related to the work from the following three aspects: topic modeling, automatic keyphrase extraction and word/phrase embedding. (1) Topic Modeling. Topic modeling has been widely used to extract topics from large-scale scientific literature [Blei, Ng, and Jordan2003, Griffiths and Steyvers2004, Steyvers and Griffiths2007]. Topics in these models are usually in the form of multinomial distributions over words, which makes it hard for researchers to identify which specific topics these distributions stand for [Mei, Shen, and Zhai2007]. To address this challenge, many work has been conducted to find an automatic or semi-automatic way to label these topic models [Mei, Shen, and Zhai2007, Ramage et al.2009, Lau et al.2011], which alleviate this problem to some extent. (2) Automatic Keyphrase Extraction. There are mainly two approaches to extracting keyphrases: supervised and unsupervised. In supervised methods, the keyphrase extraction problem is usually re-casted as a classification problem [Witten et al.1999, Turney2000] or as a ranking problem [Jiang, Hu, and Li2009]. Existing unsupervised approaches to keyphrase extraction can be categorized into four groups [Hasan and Ng2014]: graph-based ranking [Mihalcea and Tarau2004], topic-based clustering [Liu et al.2009], simultaneous learning [Wan, Yang, and Xiao2007] and language modeling [Tomokiyo and Hurst2003]. (3) Word/Phrase Embedding. Feature learning has been extensively studied by the machine learning community under various headings. In natural language processing (NLP) area, feature learning of words/phrases is usually referred to as word/phrase embedding, which means embedding words/phrases into a latent feature space [Roweis and Saul2000, Mikolov et al.2013]. This method can help calculate relations/similarities between words/phrases. In our work, we embed topics into a latent feature space, which is similar to this line of work.

6 Conclusion

In this paper, we formally formulate the problem of top- area topics extraction. We propose FastKATE in which topics have both explicit and latent representations. We leverage a large-scale knowledge base (Wikipedia) to learn topic embeddings and use this kind of representations to help capture the representativeness of topics for given areas. We develop a heuristic algorithm together with a fast implementation to efficiently solve the problem and prove it is at least of the optimal solution. Experiments on three real-world datasets and in five different areas validate our model’s effectiveness, robustness, real-timeness (return results in s), and its superiority over other methods. In future, we plan to integrate more knowledge bases and also try to apply our model to a broader range of problems.


  • [Blei, Ng, and Jordan2003] Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent dirichlet allocation. Journal of machine Learning research 3(Jan):993–1022.
  • [Gary and Johnson1979] Gary, M. R., and Johnson, D. S. 1979. Computers and intractability: A guide to the theory of np-completeness.
  • [Griffiths and Steyvers2004] Griffiths, T. L., and Steyvers, M. 2004. Finding scientific topics. Proceedings of the National academy of Sciences 101(suppl 1):5228–5235.
  • [Grineva, Grinev, and Lizorkin2009] Grineva, M.; Grinev, M.; and Lizorkin, D. 2009. Extracting key terms from noisy and multitheme documents. In Proceedings of the 18th international conference on World wide web, 661–670. ACM.
  • [Hasan and Ng2014] Hasan, K. S., and Ng, V. 2014. Automatic keyphrase extraction: A survey of the state of the art. In ACL (1), 1262–1273.
  • [Jiang, Hu, and Li2009] Jiang, X.; Hu, Y.; and Li, H. 2009. A ranking approach to keyphrase extraction. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, 756–757. ACM.
  • [Jones1973] Jones, K. S. 1973. Index term weighting. Information storage and retrieval 9(11):619–633.
  • [Karp1972] Karp, R. M. 1972. Reducibility among combinatorial problems. In Complexity of computer computations. Springer. 85–103.
  • [Kempe, Kleinberg, and Tardos2003] Kempe, D.; Kleinberg, J.; and Tardos, É. 2003. Maximizing the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, 137–146. ACM.
  • [Lau et al.2011] Lau, J. H.; Grieser, K.; Newman, D.; and Baldwin, T. 2011. Automatic labelling of topic models. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, 1536–1545. Association for Computational Linguistics.
  • [Liu et al.2009] Liu, Z.; Li, P.; Zheng, Y.; and Sun, M. 2009. Clustering to find exemplar terms for keyphrase extraction. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, 257–266. Association for Computational Linguistics.
  • [McGuinness, Van Harmelen, and others2004] McGuinness, D. L.; Van Harmelen, F.; et al. 2004. Owl web ontology language overview. W3C recommendation 10(10):2004.
  • [Mei, Shen, and Zhai2007] Mei, Q.; Shen, X.; and Zhai, C. 2007. Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 490–499. ACM.
  • [Mihalcea and Tarau2004] Mihalcea, R., and Tarau, P. 2004. Textrank: Bringing order into text. In EMNLP, volume 4, 404–411.
  • [Mikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 3111–3119.
  • [Mintz et al.2009] Mintz, M.; Bills, S.; Snow, R.; and Jurafsky, D. 2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, 1003–1011. Association for Computational Linguistics.
  • [Nemhauser, Wolsey, and Fisher1978] Nemhauser, G. L.; Wolsey, L. A.; and Fisher, M. L. 1978. An analysis of approximations for maximizing submodular set functions—i. Mathematical Programming 14(1):265–294.
  • [Ramage et al.2009] Ramage, D.; Hall, D.; Nallapati, R.; and Manning, C. D. 2009. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, 248–256. Association for Computational Linguistics.
  • [Roweis and Saul2000] Roweis, S. T., and Saul, L. K. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326.
  • [Salton and Buckley1997] Salton, G., and Buckley, C. 1997. Term-weighting approaches in automatic text retrieval. Morgan Kaufmann Publishers Inc.
  • [Steyvers and Griffiths2007] Steyvers, M., and Griffiths, T. 2007. Probabilistic topic models. Handbook of latent semantic analysis 427(7):424–440.
  • [Svitkina and Fleischer2011] Svitkina, Z., and Fleischer, L. 2011. Submodular approximation: Sampling-based algorithms and lower bounds. SIAM Journal on Computing 40(6):1715–1737.
  • [Tang et al.2015] Tang, J.; Zhang, C.; Cai, K.; Zhang, L.; and Su, Z. 2015. Sampling representative users from large social networks. In AAAI, 304–310. Citeseer.
  • [Tomokiyo and Hurst2003] Tomokiyo, T., and Hurst, M. 2003. A language model approach to keyphrase extraction. In Proceedings of the ACL 2003 workshop on Multiword expressions: analysis, acquisition and treatment-Volume 18, 33–40. Association for Computational Linguistics.
  • [Turney2000] Turney, P. D. 2000. Learning algorithms for keyphrase extraction. Information retrieval 2(4):303–336.
  • [Wan, Yang, and Xiao2007] Wan, X.; Yang, J.; and Xiao, J. 2007.

    Towards an iterative reinforcement approach for simultaneous document summarization and keyword extraction.

    In ACL, volume 7, 552–559.
  • [Witten et al.1999] Witten, I. H.; Paynter, G. W.; Frank, E.; Gutwin, C.; and Nevill-Manning, C. G. 1999. Kea: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on Digital libraries, 254–255. ACM.