Less is More: Learning Prominent and Diverse Topics for Data Summarization

11/29/2016 ∙ by Jian Tang, et al. ∙ University of Michigan 0

Statistical topic models efficiently facilitate the exploration of large-scale data sets. Many models have been developed and broadly used to summarize the semantic structure in news, science, social media, and digital humanities. However, a common and practical objective in data exploration tasks is not to enumerate all existing topics, but to quickly extract representative ones that broadly cover the content of the corpus, i.e., a few topics that serve as a good summary of the data. Most existing topic models fit exactly the same number of topics as a user specifies, which have imposed an unnecessary burden to the users who have limited prior knowledge. We instead propose new models that are able to learn fewer but more representative topics for the purpose of data summarization. We propose a reinforced random walk that allows prominent topics to absorb tokens from similar and smaller topics, thus enhances the diversity among the top topics extracted. With this reinforced random walk as a general process embedded in classical topic models, we obtain diverse topic models that are able to extract the most prominent and diverse topics from data. The inference procedures of these diverse topic models remain as simple and efficient as the classical models. Experimental results demonstrate that the diverse topic models not only discover topics that better summarize the data, but also require minimal prior knowledge of the users.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A huge amount of unstructured data has been continuously generated from various online information sources at an unprecedented speed, usually referred to as the “big data.” A major challenge for data scientists has emerged along with this trend, which is concerned with how to facilitate the understanding and exploration of the big data.

Statistical topic models [4], e.g., the probabilistic latent semantic analysis (PLSA) [9] and the latent Dirichlet allocation (LDA) [5], have been widely recognized as effective tools to assist the real users in understanding and exploring the data. This family of probabilistic models are designed to automatically infer the hidden themes (a.k.a. topics) that are salient in the data collection. The discovered topics can be further utilized in other data mining tasks such as classification [11, 25], clustering [10, 22]

, sentiment analysis 

[18, 13], and user modeling [3].

These models commonly rely on a strong assumption that the user knows the actual number of topics in the data. This assumption may hold for small and restricted data sets, but becomes impractical when the data is big and when the domain is open, thus has limited the usefulness of topic models in practical data exploration tasks. Sophisticated treatments have to be applied to topic models in order to relax this assumption, which lead to various nonparametric [24] and hierarchical versions of the models [12] that are much more complicated. While these treatments successfully advanced the theoretical understanding of topic models, they have not completely solved the problem. In practice, the quality of topics extracted by nonparametric models is usually compromised, while how to find the hierarchical structure of topics remains mysterious.

Clearly, all the aforementioned challenges are due to the practice that we wanted to extract all the topics in the data. Is this, however, a necessary practice at all? We find that in many real world scenarios, the user actually wants to explore a few most prominent topics out there, instead of enumerating all possible topics in the data. Indeed, when searching for news reports about an event, although an investigator may still want to enumerate every aspect or detail of the event, an ordinary Web user only needs to read a few articles that well summarize the major aspects of the event. When exploring the literature of a new field, a researcher may start with digesting several major trends of research instead of investigating every research topic in that field. That says, different from investigation tasks where the concern is to enumerate all the topics in a data set, in most data exploration tasks the concern is to find the top K topics that reasonably summarize

, or cover the entire data collection. This is analogical to text summarization tasks where a summary of a limited length is created to cover the most important points of the original document(s), or to information retrieval tasks where a limited number of results are presented to represent the big picture of content relevant to an information need. Comparing to the conventional applications of topic models, the number K in this practice does not rely on the impractical assumption about the user’s prior knowledge, but rather depends on the budget or the personal need of the user (e.g., I have time to digest three research topics, or I want to know the two most representative perspectives in a debate). In practice, this number, K, can be much smaller than the actual (unknown) number of topics in the data.

As a good summary of the data collection, the K topics extracted should not only be meaningful and interpretable, but also cover as much content of the original collection as possible. This naturally requires the topics extracted to be the most prominent ones in the data collection, while at the meantime they should cover diverse aspects of the collection. A straightforward way to generate such a summary is to simply fit exactly K topics through a classical topic model. However, fitting only a few topics to a big data collection is likely to under-fit the data, making the extracted topics less meaningful or interpretable.

An alternative approach may first fit a large number of topics to the data and then pick the most important ones, e.g., the ones with the largest sizes. These topics are likely to be meaningful to the user. However, learning too many topics runs the risk of over-fitting the data, making the extracted topics either too small or too similar to each other. As a result, even the top ones may only cover partial aspects of the collection. This challenge motivates us to investigate new ways to extract the most representative, meaningful, and yet non-redundant topics from the data.

In this paper, we propose a reinforced random walk among a “social network” of the topics that allows prominent topics to absorb

tokens from similar and smaller topics. Specifically, during the inference procedure, a topic network is built to model the correlations, or interconnections among the topics. For each word token in the collection, after it is assigned to one of the topics, it is allowed to take a random walk in the topic network, and possibly transit to other topics. The probability of transition from one topic to another is initialized based on the similarity between the topics and then

reinforced by the size of the target topic, i.e., the number of tokens already assigned or transited to it. During this process, the tokens belonging to the smaller topics are likely to transit to the larger and similar topics. In other words, the larger topics will absorb resource from its local neighborhood (of similar but smaller topics). While the process continues, a few prominent topics stand out from their local neighborhood and become the most representative ones in the topic network. At the meantime, these topics tend to represent different neighborhoods in the network, thus enhances the diversity among the top topics (see Figure 1).

(a) A network of topics. Nodes are topics extracted by a classical topic model; topics with a high similarity are connected through an edge.
(b) Network of topics after the reinforced random walk. A few prominent and non-redundant topics stand out and present a good coverage of the network.
Figure 1: Reinforced Random Walk on a “Social” Network of Topics.

We embedded this reinforced random walk process into the classical topic models PLSA and LDA, and obtained two diverse topic models DivPLSA (abbreviation for diverse PLSA) and DivLDA(abbreviation for diverse LDA). The learning procedures of the two models remain as simple and efficient as the classical ones but are able to learn the prominent and diverse topics pervading the data. Top topics extracted by the diverse topic models are not only meaningful but also serving as good summaries of the collection. Experimental results using four real-world datasets demonstrate the effectiveness of the two diverse topic models for data summarization, compared with classical topic models and alternative approaches.

Organization. The rest of this paper is organized as follows. Section 2 discusses the related work. Section 3 formally defines the problem of topic modeling for data summarization. Section 4 introduces the reinforced random walk and the proposed diverse topic models. The effectiveness of the diverse topic models are empirically evaluated in Section 5, and the study is concluded in Section 6.

2 Related Work

Our work presents a good analogy to extractive text summarization [7], the goal of which is to construct a short summary of an individual document or multiple documents by extracting the most representative sentences in the document(s). Instead of summarizing a single document or a few documents with sentences, topic modeling is used to summarize a large document collection with latent topics pervading the collection, in which the topics serve as the role of “sentences” in document summarization. Analogical to text summarization, the coverage of the extracted topics in the original data should be maximized and the redundancy among the extracted topics should be minimized. In other words, the extracted topics should be both prominent and diverse. The diversity among the results has been recognized to be important in not only text summarization [6], but also many other applications including Web search engines [1], query suggestion [14], ranking in information networks [17], and recommender systems [27].

Many algorithms for enhancing the diversity of the results of a retrieval or a mining system have been proposed. In [6], Carbonell et al. proposed the Maximal Marginal Relevance (MMR) criterion for the problem of document retrieval, which aims to reduce the redundancy among the retrieved documents while preserving their relevance to the query. The MMR algorithm greedily selects the document that has the largest marginal relevance to the query, which is the relevance of the document to the query penalized by the largest similarity between the document and the selected documents. In [26], a “soft” version of MMR algorithm called Grasshopper based on absorbing random walk is proposed. In [17], Mei et al. proposed a unified process DivRank to generate the entire ranked list of vertices in information networks by balancing the prestige and diversity. The DivRank algorithm is built on top of the vertex-reinforced random walk, in which the transition probabilities between the vertices are reinforced by the number of visits to the target vertex. Neighboring vertices are competing resources against each other, and finally the probability mass is distributed among some diverse vertices. Our proposed reinforced random walk shares similar idea with the DivRank algorithm. Instead of reinforced by the number of visits to each vertex (topic in our case), the transition probabilities of the random walk by the word tokens are reinforced by the size of the topics. By doing this the larger topics would gradually absorb tokens from the smaller similar ones and end up with the most prominent and diverse topics in the data.

Another related direction is to learn diversified mixture components in area of mixture models  [21, 28]. In most of the existing mixture models, the mixture components are identically drawn from a prior distribution. When the data is over-fitted, multiple mixture components are used to model one underlying group in the data, which results in many redundant mixture components. In [21]

, Petralia and Rao proposed a repulsive mixture prior which penalizes redundant components. Zou and Adams replaced the underlying i.i.d. prior for mixture components with a determinantal point process (DPP), which defines a probability measure over the entire set of probability distributions 

[28]. However, though these models are able to obtain more diverse mixture components than classical mixture models, multiple similar mixture components are still used to model one group in the data. In our proposed diverse topic models, the larger mixture components are able to fully absorb the smaller duplicated ones, and this indicates our models can reduce the number of unnecessary mixture components.

3 Problem Definition

Statistical topical models represent and summarize the data through the discovery of hidden topics pervading the data collection. Each topic is formally defined as follows:

Definition 1

A topic is defined as a multinomial distribution over words in a vocabulary , i.e. . Without loss of generality, we assume there are topics in total in a given data collection, where is unknown.

In the practice of data exploration, instead of enumerating all the topics, a more realistic need is to extract a given number of representative or prominent topics in the data. Based on these representative topics, users can have a quick understanding of the semantic structure of the data collection. Therefore, it is desirable to be able to extract a list of topics ranked according to their importance in the data. Formally, we define the problem as follows:

Definition 2

Given a data collection , the problem of learning top-K topics aims to infer a list of topics ranked according to their prominence in , where is a number specified by the user.

Note that different from conventional applications of topic modeling, the number here is specified completely based on the user’s need instead of the prior knowledge of the data, and can be substantially smaller than the true number of topics (), which is unknown. To extract top- representative topics in the data, a straightforward way is to apply a classical topic model (e.g., LDA) directly to fit exactly topics into the data (i.e., set ). In this case, one would expect the inferred topics to be the most representative topics in the data. However, fitting only a few topics to the data usually results in topics that are actually a mixture of multiple topics, the coherence or interpretability of which is compromised.

As it is difficult for a user to find an appropriate number of topics , an alternative way is to fit the data with a large number of topics, which usually yields semantically coherent topics. These topics can then be ranked by the proportions of the data they cover (e.g., number of word tokens assigned to each topic). Although these topics are likely to be coherence and meaningful, a large may over-fit the data, returning many duplicated topics or smallish topics. As a result, even the top-ranked topics may have a small coverage of the original data, due to the fact that they are either too small or redundant. To provide a good summary of the data, it is desirable that the selected topics are both coherent and interpretable, and have a high coverage of the data. Formally, we have

Definition 3

Given a data collection and a given integer , the problem of topic summarization aims to find a set of topics such that every is a coherent topic and the set of topics cover as much information in as possible.

Apparently, neither of the simple approaches above is suitable for topic summarization. In next section, we introduce a novel approach, diverse topic modeling, to infer prominent and diverse topics which better summarize the data.

4 Diverse Topic Modeling

In most existing topic models or mixture models, the topics or mixture components are generally i.i.d. according to a prior distribution. No restriction is placed upon the entire set of topics. This becomes problematic when the data is over-fitted, in which case each underlying theme of original data may be represented by multiple similar topics or be split into topics with a smaller granularity.

To tackle this problem, the diversity, or redundancy among topics must be considered. Recent work replaces the independent assumption with priors favoring diversified topics (e.g., [21, 28]), usually through a regularization term over the entire set of topics. A set that contains similar topics will be penalized. Although these treatments can infer more diverse topics than the classical models, they still make the same assumption on the number of topics, and multiple topics are still being used to represent each underlying theme.

A more reasonable way is to merge similar topics into bigger ones, so that the top-ranked topics become more prominent and also more diverse. In this paper we propose to let the prominent topics merge with or absorb the smaller and similar topics. Naturally, we need a process under which topics can communicate with each other and compete for word tokens. If a topic absorbs all the word tokens belonging to another topic, the latter topic is completely absorbed by the former. In the following, we introduce a reinforced random walk as such a process.

4.1 Reinforced Random Walk

To model how the topics are competing word tokens from each other, the relationships among the topics must be considered. Specifically, we introduce a “social” network of topics, or a topic network, which is an undirected graph , where is the set of topics and there is an edge between each pair of topics . The weight of ,

is defined as the cosine similarity between the word distributions

and , i.e.,

(1)

We formulate the absorbing process among the topics as a random walk process by the word tokens on the topic network. Specifically, if one token belonging to one topic transits to another topic, we can say a token of this topic is absorbed by the other topic. If all the tokens belonging to one topic are absorbed by other topics, this topic dies out.

As we want the larger topics to gradually absorb the smaller and similar topics, the tokens belonging to the smaller topics should be more likely to transit to the larger ones. This can be achieved through reinforcing the transition probability among the topics by the size of the topics, i.e. the number of tokens belonging to the topic. Specifically, the transition probability from topic to is defined as

(2)

where is the “organic” transition probability among the topics, is the size of the topics , is the parameter used to control the reinforcement intensity by the topic size, and the normalizing factor is calculate as

(3)

The “organic” transition probabilities among the topics can be calculated based on the similarity between the topics. Meanwhile, it is also reasonable to assume that each token can also have some probability to stay on the current topic instead of transiting to its neighbors. In this case, tokens belonging to the large topics can still stay on the topic itself. Specifically, we define the as follows:

(4)

where is the probability of transiting to neighboring topics.

The above random walk process is related to the stochastic process vertex-reinforced random walk [20] and DivRank [17], in both of which the transition probabilities among the vertices in the network are reinforced by the number of visits to the vertices. In these processes, the neighboring vertices are actually competing resources from each other. The reinforcement leads to the rich-get-richer phenomena and finally the resources are distributed across the prominent and diverse vertices in the network.

We embed the above random walk process into the inference processes of PLSA and LDA, and proposed two diverse topic models DivPLSA and DivLDA. During the inference processes, the topic assignment for each token is first calculated based on the EM algorithm (for PLSA) or Gibbs sampling (for LDA), and then a new topic assignment is obtained based on the reinforced random walk.

4.2 Diverse Topic Models

4.2.1 DivPLSA

The PLSA model assumes each document is a mixture of topics with different proportions. Given a collection , the log-likelihood of under PLSA assumption is calculated as:

(5)

where is the frequency of word in document , is the probability of topic in , is the probability of word being generated by topic and is the number of topics specified. The parameters of the model, i.e. , and

, are estimated by maximizing the log-likelihood. An EM algorithm is generally applied to solve the problem. In the E-step, it calculates the posterior distribution of the topic assignment of each token, i.e.

based on the current model parameters. In the M-step, it updates the model parameters based on the posterior probability of topic assignments calculated in the E-step.

We embedded the reinforced random walk process into the E-step. For each token , we first obtain the posterior distribution of topic assignments (Eqn. (6)). Then one-step reinforced random walk on the topic network is conducted to determine whether this topic is absorbed by other topics or staying on the current topic, which obtains the new topic assignments (Eqn. (7)). The transition probabilities among the topics can be periodically updated in the M-step. We summarize the detailed updating equations as below:

E-step:

(6)
(7)

M-step:

(8)
(9)
(10)

In Alg. 1, we present the detailed learning procedure of the DivPLSA model. Users still need to specify the starting number of topics , which can be strategically set very large and let the data be over-fitted. During the learning process, the small topics will be eventually absorbed by the larger ones, i.e. the sizes of those topics become 0. We can monitor the number of active topics (topics whose sizes are greater than 0) during the process. When converges, the whole algorithm stops.

Input: Training data , the starting number of topics , the parameter to control the reinforcement intensity by the topic size ;
Output: Number of diverse topics , the word distributions of topics ;
initialization: randomly initialize the document-topic and topic-word distributions;
calculate the sizes of the topics and the transition probabilities among the topics according to (2);
while  no convergence do
E-step: for each word of each document in
  •  

    calculate the posterior probability of topic assignments according to (6);

  •  

    conduct one-step reinforced random walk on

  •       the topic network according to (7);
    M-step:
    •  

      update document-topic, topic-word distributions

          and topic sizes according to (8), (9) and (10);
  •  

    update the transition probability

  •       among the topics according to (2);
  •  

    calculate the number of active topics ;

  •       
    end while
    Algorithm 1 The DivPLSA model

    4.2.2 DivLDA

    The LDA model is a Bayesian treatment of the PLSA by placing the Dirichlet priors upon both the document-topic and topic-word distributions, which results the model is computationally intractable. Collapsed Gibbs sampling [8]

    algorithm is widely used for the inference due to its simplicity and effectiveness. In Gibbs sampling, each dimension of the joint distribution is sampled alternatively based on the distribution conditioned on all the other variables. Specifically, in LDA the conditional distribution of the topic assignment

    associated with word is calculated via:

    (11)

    represents the topic assignments of all the words excluding the current word. is the number of words assigned to the topic in document ; is the number of times that the word is assigned to the topic; and is the number of tokens assigned to the topic, i.e., the size of topic . For all these counts, the current token is excluded.

    In DivLDA, instead of calculating the expectation of the topic assignments as done in the DivPLSA, sampling is used to obtain the topic assignments. For each token, we first sample the topic assignment based on equation (11), and then conduct one-step reinforced random walk starting from to obtain a new topic assignment according to

    (12)

    The final document-topic and topic-word distributions can be estimated via:

    (13)
    (14)

    The detailed inference process is summarized in Alg. 2.

    Input: Training data , the starting number of topics , the parameter to control the reinforcement intensity by the topic size , the total number of Gibbs sampling iterations ;
    Output: Number of diverse topics , the word distributions of topics ;
    initialization: randomly sample the topic assignments for all the word tokens.;
    calculate the sizes of the topics and the transition probabilities among the topics according to (2);
    while  iter M do
    for each token of each document in
    •  

      for the current assignment of token , decrement counts and sums: , , ;

  •  

    sample the topic assignments for the token according to (11);

  •  

    conduct one-step reinforced random walk on the topic network according to (12);

  •  

    for the latest assignment , increment counts and sums: , ,;

  •        update the transition probability among the topics according to (2);
           optimize the parameter according to [5];
          
    end while
    Algorithm 2 The DivLDA model

    4.3 Discussion

    We discuss some practical issues of the two models.

    Convergence. In the above two models, after plugging in the reinforced random walk process within the EM or Gibbs sampling process, there are no explicit objective functions any more. We empirically prove that both the number of active topics and the data likelihood will converge (see Figure 2(d) and 2(h)). We leave the theoretical justification of the convergence as the future work.

    How to set the parameter ? The parameter controls the transition probability to the neighbor topics or staying on itself. It takes similar effect as the step size in the gradient descent method, and hence in practice a small (e.g., 0.1) can be used.

    How to set the parameters and ? We empirically show that the performance of the two models are not sensitive to the parameters and (See Figure 5 and 6). In practice, can be set to be very large to let the data over-fitted. can be usually set within for DivPLSA and for DivLDA.

    Scalability. Both DivPLSA and DivLDA can be easily scaled by making use of existing large-scale topic modeling techniques. The E-step of DivPLSA can be parallelized by assigning the documents to different processors or nodes. A scaled version of DivLDA can be built on top of the existing large scale LDA model, e.g., the yahoo-LDA model in [2].

    5 Experiment

    In this section, we move forward to evaluate the effectiveness of our proposed diverse topic models. We evaluate the performances on four real-world data sets.

    5.1 Datasets

    4CONF. We start with a small data set, the 4CONF data set as in [16]. The data set is constructed from papers published in four conferences including KDD, SIGIR, NIPS, and WWW. Every document corresponds to an author by aggregating the titles of the author’s papers. Stop words and words appearing in less than 10 documents are removed. This small data set allows us to interpret the topics intuitively and visually.

    20NG. This is the widely used 20 newsgroup data set in text mining. Stop words and words appearing in less than 20 documents are removed. We sample 1,000 documents from the set as a holdout data set.

    WIKIPEDIA. This includes 10,000 articles randomly sampled from 4,636,797 Wikipedia articles in English. Stop words and words appearing in less than 100 documents are removed. We also hold out a sample of 1,000 articles.

    DBLP. The larger data set consists of all papers with abstracts in the computer science bibliography as in [23]. Stop words and words appearing in less than 50 documents are removed.

    Table 1 summarizes the statistics of all the data sets.

    Dataset # train # holdout vocabulary # tokens
    4CONF 8,486 1,672 80,642
    20NG 11,267 1,000 7,642 1,056,012
    WIKIPEDIA 10,000 1,000 196,665 3,006,817
    DBLP 529,434 25,404 36,899,908
    Table 1: Statistics of the data sets

    5.2 Evaluation Metrics

    To provide a good summary of the data, the topics need to be highly coherent and also provide a high coverage of the information in the original data. We introduce metrics to evaluate the semantic coherence and the coverage of the topics respectively. The quality of a topic is measured through the semantic coherence of the word distribution while the information coverage of a set of topics is measured through their predictive performance on the holdout data set, using the well-adopted perplexity metric.

    Topic Semantic Coherence. We measure the semantic quality of the topics through the semantic coherence of the topics. In [19], Newman et al. measures the semantic coherence of each topic as the average point-wise mutual information (PMI) of every word pair among the top-ranked words in the topic. Specifically, the overall semantic topic coherence of a set of topics is calculated as:

    (15)

    where are the words ranked at the top positions in topic . is the probability that the pair of words co-occur in the same document while is probability of a word appearing in a single document. The top ranked 20 () words are used in our experiments.

    To calculate the PMI, generally a large dataset has to be used. The entire 20NG data set (12,267 documents) and the entire English Wikipedia (4.6 million documents) are used in calculating the PMI for the experiments on the 20NG data set and on the WIKIPEDIA data set, respectively.

    Perplexity. We measure the information coverage of a set of topics by perplexity, which measures the predictive performance of these topics on the holdout data set. Specifically, each document in the held-out data set is split into two parts . The likelihood of the second part of the documents ( of the document) is calculated based on the training data and the first words of the document . Specifically, we have

    (16)

    5.3 Algorithms for Comparison

    We compare the following algorithms for selecting top-K topics for data summarization.

    • PLSA/LDA. The classical PLSA or LDA model is directly utilized to learn exactly K topics.

    • PLSA/LDA-TopK. We first train PLSA/LDA with a large number of topics, and then pick top-K topics with the largest sizes.

    • PLSA/LDA-MMR-TopK. PLSA/LDA is used to train a large number of topics. Then the MMR algorithm [6] is used to select top-K topics from the topics. Although the MMR algorithm is proposed in a query-dependent setting, we adapt it to our scenario. The relevance between the query and each topic is measured as the coverage of this topic in the whole data set, and the similarity among each pair of topics is calculated as the cosine similarity of the word distributions. The best results are reported by empirically tuning the parameter in the MMR algorithm.

    • PLSA/LDA-DivRank-TopK. PLSA/LDA is used to train a large number of topics, and then the DivRank algorithm [17] is used to select top-K

      topics. In the DivRank algorithm, we treat the proportions of the topics as the preference vector, and the weight between each pair of topics is calculated as the cosine similarity of the corresponding word distributions. The best results are reported by empirically tuning the parameter

      and in the DivRank algorithm.

    • DivPLSA/DivLDA-TopK. DivPLSA/DivLDA is applied on the data set and then the top-K topics with the largest sizes are selected.

    (a) Iteration 1
    (b) Iteration 20
    (c) Iteration 50
    (d) #topics v.s. #iterations
    (e) Iteration 70
    (f) Iteration 90
    (g) Iteration 200
    (h) likelihood v.s. #iterations
    Figure 2: The learning behavior of DivPLSA in 4CONF dataset.

    5.4 Learning Behavior of Diverse Topic Models

    We start the experimental results by investigating the learning behavior of our proposed diverse topic models. We take the DivPLSA model as an example. Similar behavior of the DivLDA is observed and hence we do not include here. In Figure 2, we show how the topic assignments of the documents and words change along the iterations in the 4CONF dataset. We first build a document-word bipartite network with the edge weight as the word frequency in the document. A DrL [15] layout algorithm is applied on this bipartite network to calculate the two-dimensional coordinates for both the documents and words. In Figure 2(a), each data point represents a document, and the most popular 100 words in the dataset are shown. Different colors represent different topics, and each document or word is assigned to its most probable topic. The size of each word is determined based on the probability of the word in the assigned topic. We expect that documents and words belonging to the same topics (colors) are likely to lie within a dense area.

    We start with 20 topics in the DivPLSA model with . In the first 50 EM iterations, the same EM iteration for PLSA is conducted and the reinforced random walk process kicks in after the 50th iteration. In the first EM iteration (Figure 2(a)), the topic assignments of all the tokens in the data set are randomly initialized and we can see that the colors in the entire plot are totally mixed. There is no dense area taking the same color. As time goes on, dense area with the same color gradually emerges, e.g., the 20th iteration (Figure 2(b)

    ). The results become better when reaching the 50th iteration. We can see the emergence of topics such as “information retrieval”, “Web semantic”. However, we can see that too many topics are fitted into the data, and many of them are small and similar to each other. No clear topic semantic structure is discovered. After the 50th iteration, the reinforced random walk process kicks in and the smaller topics are gradually absorbed by the larger and similar ones. We can see better topical structure in 70th iterations, in which fewer than 20 topics are active. In the 200th iteration, the whole procedure stops and five active topics remain (the other 15 topics are fully absorbed) including “information retrieval", “web”, “data mining”, “neural network” and “learning, Bayesian”. Recall that the dataset is collected from the four conferences “SIGIR”, “WWW”, “KDD” and “NIPS”, the five topics are a good summary of the original data. It is also interesting to notice that two diverse topics “neural network" and “learning, Bayesian” are discovered in the “NIPS” conference, which indicates there are two distinctive research communities in the machine learning area. The final list of topics is represented in Table 

    2.

    Top-ranked Words of Each Topic
    learning classification models clustering algorithm support bayesian
    retrieval information text query system document evaluation
    neural networks network model recognition visual learning
    web search semantic xml extraction analysis content
    data mining knowledge discovery large databases rules
    Table 2: Topics extracted by DivPLSA

    In Figure 2(d), we show how the number of active topics changes along the iterations. In the first 50 iterations, the number of active topics remains 20 as the random walk process has not kicked in yet. The number of active topics begins to decrease in around the 70th iterations as the smaller topics are absorbed by the larger and similar ones, and dramatically decreases to 5 in around the 90th iterations. The number of active topics converges after 90 iterations. During the process, we can see that the absorbing process converges quite quickly, taking around 20 EM iterations.

    Figure 2(h) shows how the likelihood of the training data changes over iterations. The likelihood keeps increasing at first and starts to decrease in around the 70th iteration, as the number of active topics decreases. In around the 90 iterations, the likelihood stops decreasing as the number of active topics converges. After this, the likelihood keeps increasing and converges in around the 150th iterations.

    Overall, we can see that when the data is over-fitted, the inferred topics by the classical topic models tend to be small and duplicated, which is not good for data summarization purpose. The reinforced random walk is able to merge the similar topics and end up with some large diverse topics. Both the number of topics and the likelihood of the data will finally converge.

    (a) Perplexity (b) Semantic Coherence
    Algorithm Top-2 Topics PMI Proportion
    PLSA people writes article don god time good apr 0.499 58.63%
    writes system article file mail don windows key 0.449 41.37%
    PLSA-TopK god jesus church christ bible lord man faith 1.845 2.98%
    server mit sun motif source version library tar 1.905 2.95%
    DivPLSA-TopK god people jesus christian bible don life time 1.177 12.48%
    file program window files ftp image version server 1.325 12.19%
    (c) Comparison of the topics learned by different methods.
    Figure 3: Performances on the 20NG dataset.
    (a) Perplexity (b) Semantic Coherence
    Algorithm Top-2 Topics PMI Proportion
    PLSA user overlap university world school calculated time war 0.295 52.70%
    template article page wikipedia made delete image user 0.862 47.30%
    PLSA-TopK album music song band released rock records single 1.590 2.57%
    wikipedia http php user page article talk edit 1.280 2.40%
    DivPLSA-TopK film series show episode time man television movie 1.051 8.51%
    music album song band released single rock songs 1.563 5.23%
    (c) Comparison of the topics learned by different methods.
    Figure 4: Performances on the WIKIPEDIA dataset.

    5.5 Evaluation for Data Summarization

    Next, we move forward to evaluate the performances of different algorithms for the task of topic summarization, which aims to learn and select top-K topics for summarizing the data. The performances are evaluated in terms of the semantic coherence and information coverage of the topics, which are measured by PMI and perplexity respectively.

    The results on the 20NG dataset are presented in Figure 3. Figure 3(a) presents the information coverage of the topics learned by various algorithms in terms of perplexity. The lower the perplexity, the larger the information coverage. Overall, the more topics are selected, the higher the information coverage. All the methods except PLSA/LDA are trained with 50 topics (the starting number of topics is also 50 for DivPLSA/DivLDA). The PLSA-TopK algorithm, which selects the largest top-K topics in the 50 topics trained by the PLSA model, achieves the worst performance. In this case, the data is over-fitted and the inferred topics are small and similar to each other, which would have a low information coverage of the data. Both PLSA-MMR-TopK and PLSA-DivRank-TopK outperform PLSA-TopK by selecting the non-redundant largest top-K topics. LDA-TopK achieves better performance than PLSA-TopK, which may due to the fact that the largest topics learned by LDA is larger than the ones learned by PLSA through an examination of the sizes of the largest topics learned by the two models. Similarly, LDA-MMR-TopK and LDA-DivRank-TopK further improve the information coverage by taking into the diversity among the selected topics into consideration.

    As there are 20 categories in the 20NG dataset, it is reasonable to think there are 20 topics in the dataset. Therefore, we trained PLSA/LDA with 20 topics and the topics with the largest size are selected (marked as PLSA/LDA-20-TopK). We can see that the topics selected from 20 topics (PLSA/LDA-20-TopK) has a high information coverage than from 50 topics (PLSA/LDA-TopK). This indicates that if we can have a good estimation of the appropriate number of topics in the data, it is likely to infer better topics with a high information coverage. Our DivPLSA and DivLDA models (also starting from 50 topics) outperform all these models by learning the most prominent and diverse topics without worrying about estimating the appropriate number of topics. Though starting from a large number of topics, the DivPLSA/DivLDA model is able to merge those similar topics and end up with the diverse ones. We are also surprised to see that deploying PLSA/LDA to train just topics (denoted as PLSA/LDA) obtains the best performance in terms of information coverage. However, though the discovered topics have a high information coverage in terms of perplexity, the semantic coherence of the topics is pretty low and the topics are not interpretable to users.

    In Figure 3(b), we compare the semantic coherence of the topics inferred by different algorithms. Though the topics directly learned by PLSA/LDA have a high information coverage, the semantic coherence of the topics is the worst. This is because the data is under-fitted and the inferred topics are a mixture of multiple topics. Therefore, the topics are more likely to be the background topic of the whole data collection (See the first row in Figure 3(c)) and do not have any semantic information. The semantic quality of the top-K topics selected from 50 topics learned by PLSA using various ways (PLSA-TopK, PLSA-MMR-TopK, PLSA-DivRank-TopK) are the best among all the models. This may due to the granularity of these topics is quite small111Topics with a small granularity tend to yield a larger PMI, which can be examined from the second row in Figure 3(c). The topics learned by the DivPLSA and DivLDA have a reasonable good semantic quality meanwhile have a high information coverage, both of which are desirable for a good summarization.

    Figure 4 compares the performance on the WIKIPEDIA data set. To select an appropriate number of topics within the data, we vary different numbers of topics and choose the best one 100 based on the predictive performance on the holdout data set. All the models (except PLSA and LDA directly trained with a few topics) are trained with 100 topics. Similar behavior of different models can be observed in the data set as in the 20NG data set.

    To summarize, our DivPLSA and DivLDA models give a good summarization of the data by presenting top-K topics with a high information coverage and a reasonable good semantic quality. The K topics directly trained with PLSA/LDA are not interpretable and cannot convey useful semantic information to users at all. The top-K topics selected from a large number of topics trained by PLSA/LDA have good semantic quality but low information coverage.

    5.6 Parameter Sensitivity

    (a) Perplexity (DivPLSA)
    (b) Semantic Coherence (DivPLSA)
    (c) Perplexity (DivLDA)
    (d) Semantic Coherence (DivLDA)
    Figure 5: Parameter sensitivity w.r.t.  on the 20NG dataset.
    (a) Perplexithy (DivPLSA)
    (b) Semantic Coherence (DivPLSA)
    (c) Perplexity (DivLDA)
    (d) Semantic Coherence (DivLDA)
    Figure 6: Parameter sensitivity w.r.t.  on the 20NG dataset.

    In this part, we investigate the performance sensitivity of DivPLSA and DivLDA for data summarization w.r.t the parameters and . The parameter controls the intensity of reinforcement by topic size on the transition probabilities among the topics. The larger is, the more likely the larger topics will absorb the smaller topics. Therefore, the model will return fewer but larger topics with a larger , which tend to have a higher information coverage and lower semantic coherence. This can be observed from the results shown in Figure 5, in which the performances w.r.t for both DivPLSA and DivLDA are presented. We can see that overall the performances are not sensitive to . Note that DivLDA tends to use a smaller than DivPLSA, this may due to the Gibbs sampling algorithm and the use of hyper-parameter , which is periodically optimized.

    The performance sensitivities w.r.t for both DivPLSA and DivLDA in terms of perplexity and semantic coherence are presented in Figure 6. We can see that the performances are also not sensitive to .

    5.7 Summarization for DBLP

    Finally, we give a summarization for the entire DBLP dataset with our diverse topic models. The DivPLSA model is applied on the dataset with and , and end up with 36 topics. Table 3 shows the most prominent 10 topics in the dataset.

    Human Label Top-ranked Words Proportion
    “methodology” method proposed algorithm results approach based paper methods 7.69%
    “empirical studies” study analysis results evaluation studies performance quality case 4.55%
    “algorithm complexity” number algorithm time complexity bound show bounds polynomial 4.17%
    “knowledge representation” model models framework knowledge modeling approach domain representation 3.98%
    “computer-assisted learning” computer learning research students project technology science social 3.81%
    “machine learning” learning classification model clustering neural training models statistical 3.65%
    “parallel computing” performance memory parallel hardware implementation applications processor processors 3.63%
    “virtual systems” system user users virtual interface interaction environment information 3.46%
    Table 3: Topic summarization for the DBLP dataset.

    6 Conclusion and Future Work

    In this paper, we proposed two diverse topic models DivPLSA and DivLDA to learn the prominent and diverse topics for data summarization. The two models are built on top of a reinforced random walk on the topic network, which allows the prominent topics to absorb tokens from smaller and similar topics and improves the diversity among the extracted topics. The inference procedures for the two models remain as simple and efficient as the classical ones and are appropriate for big data analysis. Experiments on four real-world datasets prove the effectiveness of the two models for data summarization.

    The future work are two-fold. First, we plan to investigate the theoretical convergence of the two diverse topic models. Currently the convergence of the two models is empirically proved through the likelihood of the training data and the number of active topics. We believe there is an underlying objective function which tradeoffs between the data likelihood and the diversity among the topics. Second, we plan to apply the reinforced random walk into more scenarios such as unsupervised clustering, which will result in prominent and diverse clusters in the data.

    References

    • [1] R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, pages 5–14. ACM, 2009.
    • [2] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. Scalable inference in latent variable models. In Proceedings of the fifth ACM international conference on Web search and data mining, pages 123–132. ACM, 2012.
    • [3] A. Ahmed, Y. Low, M. Aly, V. Josifovski, and A. J. Smola. Scalable distributed inference of dynamic user interests for behavioral targeting. In KDD, pages 114–122, 2011.
    • [4] D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77–84, 2012.
    • [5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993–1022, 2003.
    • [6] J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 335–336. ACM, 1998.
    • [7] G. Erkan and D. R. Radev. Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res.(JAIR), 22(1):457–479, 2004.
    • [8] T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National academy of Sciences of the United States of America, 101(Suppl 1):5228–5235, 2004.
    • [9] T. Hofmann. Probabilistic latent semantic analysis. In

      Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

      , pages 289–296. Morgan Kaufmann Publishers Inc., 1999.
    • [10] A. Karandikar. Clustering short status messages: A topic model based approach. PhD thesis, University of Maryland, 2010.
    • [11] S. Lacoste-Julien, F. Sha, and M. I. Jordan. Disclda: Discriminative learning for dimensionality reduction and classification. In Advances in neural information processing systems, pages 897–904, 2008.
    • [12] W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning, pages 577–584. ACM, 2006.
    • [13] C. Lin and Y. He. Joint sentiment/topic model for sentiment analysis. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 375–384. ACM, 2009.
    • [14] H. Ma, M. R. Lyu, and I. King. Diversifying query suggestion results. In Proc. of AAAI, volume 10, 2010.
    • [15] S. Martin, W. Brown, R. Klavans, and K. Boyack. Drl: Distributed recursive (graph) layout. SAND2008-2936J: Sandia National Laboratories, 2008.
    • [16] Q. Mei, D. Cai, D. Zhang, and C. Zhai. Topic modeling with network regularization. In Proceedings of the 17th international conference on World Wide Web, pages 101–110. ACM, 2008.
    • [17] Q. Mei, J. Guo, and D. Radev. Divrank: the interplay of prestige and diversity in information networks. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1009–1018. ACM, 2010.
    • [18] Q. Mei, X. Ling, M. Wondra, H. Su, and C. Zhai. Topic sentiment mixture: modeling facets and opinions in weblogs. In Proceedings of the 16th international conference on World Wide Web, pages 171–180. ACM, 2007.
    • [19] D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 100–108. Association for Computational Linguistics, 2010.
    • [20] R. Pemantle. Vertex-reinforced random walk. Probability Theory and Related Fields, 92(1):117–136, 1992.
    • [21] F. Petralia, V. Rao, and D. B. Dunson. Repulsive mixtures. In NIPS, pages 1898–1906, 2012.
    • [22] D. Ramage, P. Heymann, C. D. Manning, and H. Garcia-Molina. Clustering the tagged web. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, pages 54–63. ACM, 2009.
    • [23] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. Arnetminer: Extraction and mining of academic social networks. In KDD’08, pages 990–998, 2008.
    • [24] Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical dirichlet processes. Journal of the american statistical association, 101(476), 2006.
    • [25] J. Zhu, A. Ahmed, and E. P. Xing. Medlda: maximum margin supervised topic models for regression and classification. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 1257–1264. ACM, 2009.
    • [26] X. Zhu, A. B. Goldberg, J. Van Gael, and D. Andrzejewski. Improving diversity in ranking using absorbing random walks. In HLT-NAACL, pages 97–104, 2007.
    • [27] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. Improving recommendation lists through topic diversification. In Proceedings of the 14th international conference on World Wide Web, pages 22–32. ACM, 2005.
    • [28] J. Zou and R. Adams. Priors for diversity in generative latent variable models. In Advances in Neural Information Processing Systems 25, pages 3005–3013, 2012.