A Toolkit for Industrial Topic Modeling
Familia is an open-source toolkit for pragmatic topic modeling in industry. Familia abstracts the utilities of topic modeling in industry as two paradigms: semantic representation and semantic matching. Efficient implementations of the two paradigms are made publicly available for the first time. Furthermore, we provide off-the-shelf topic models trained on large-scale industrial corpora, including Latent Dirichlet Allocation (LDA), SentenceLDA and Topical Word Embedding (TWE). We further describe typical applications which are successfully powered by topic modeling, in order to ease the confusions and difficulties of software engineers during topic model selection and utilization.READ FULL TEXT VIEW PDF
In this technical report, we present jLDADMM---an easy-to-use Java toolk...
In the last decade, a variety of topic models have been proposed for tex...
Topic modeling analyzes documents to learn meaningful patterns of words....
Topic models make strong assumptions about their data. In particular,
A trend across most areas where simulation-driven development is used is...
Topic models such as Latent Dirichlet Allocation (LDA) have been widely ...
When building large-scale machine learning (ML) programs, such as big to...
A Toolkit for Industrial Topic Modeling
Topic modeling is a well-recognized approach for organizing, searching and understanding the vast amounts of documents Blei (2012). In the last decade, various topic models have been proposed in academia. However, only a very small portion of these models have been applied in industry. The major problems that hinder the wide adoption of topic modeling in industry are essentially twofold. First, many research and engineering endeavors are devoted to Probabilistic Latent Semantic Analysis (PLSA) Hofmann (1999) and its fully Bayesian counterpart Latent Dirichlet Allocation (LDA) Blei et al. (2003) , while the vast majority of the other topics models lack open-source implementation. Second, most of existing research and engineering work focus on “designing new topic models” rather than “applying topic modeling results in real-life application”. This discrepancy leads to the fact that there is little work discussing how to properly utilize topic modeling results in real-life scenarios. Given the absence of a comprehensive guide, it is often difficult for developers to select an appropriate topic model for their tasks and to apply the model in a proper way.
As one step toward bridging the gap between topic modeling research and industrial applications, we open-source a topic modeling toolkit named Familia111https://github.com/baidu/Familia. Three off-the-shelf topic models are readily provided in Familia, including LDA, SentenceLDA Jo and Oh (2011) and TWE Liu et al. (2015), which are trained based upon large-scale industrial corpora. Developers have the freedom of exploring more topic models beyond LDA. For effective usage in industrial applications, efficient implementations of two utility paradigms (i.e., semantic representation and semantic matching) have been made publicly available in Familia. We further discuss several industrial cases that benefit from the technique of topic modeling, aiming to easing the confusions and difficulties for developers during topic model selection and application. The rest of this paper is organized as follows. We detail the open-sourced content of Familia in Section 2. Then we discuss several industrial cases in Sections 3. Finally, we conclude the paper in Section 4.
Familia is composed of two major contents: source code for two industrial utility paradigms and the trained topic models.
The industrial utilities of topic models can be broadly divided into two categories: semantic representation and semantic matching. For semantic representation, Familia provides two kinds of Markov chain Monte Carlo (MCMC) algorithms for users to investigate and choose: Gibbs samplingGriffiths and Steyvers (2004) and Metropolis-Hastings Yuan et al. (2015). For semantic matching, Familia contains some functions for calculating the semantic similarity between texts of different lengths: short-long text matching and long-long text matching. Moreover, in order to better display the details of trained topic models, Familia also provides functions such as nearest-word querying, topic-word querying, etc.
Three off-the-shelf topic models of high industrial value are publicly released in Familia. Each of them is trained on the large-scale industrial corpora. The characteristics of the three topics are briefly summarized as follows:
LDA: Each document is represented as a mixture over latent topics and each topic is modeled as a distribution of words.
SentenceLDA: SentenceLDA assumes that the words within one sentence are generated by the same topic. It models the co-occurrence of words in finer granularity than LDA.
TWE: TWE utilizes LDA topics as the complementary information for training word embeddings. Hence, TWE can provide both topic embeddings and word embeddings. As the topics derived by LDA are often dominated by words of high-frequency, the embeddings of TWE can partially alleviate this problem by capturing the semantics of low-frequency words under each topic.
In this section, we discuss several industrial cases that benefit from the aforementioned paradigms and models. This section acts like a guide for developers to select appropriate topic models for their tasks and apply them in a proper way.
We first discuss some cases involving semantic representation. The semantic representation derived by topic modeling typically works as features for other machine learning models.
The first case is classification of news articles
. For news feed service, the articles collected from various sources often contain low-quality ones. In order to improve user experience, we need to design a classifier to distinguish the good ones from the bad ones. Conventionally, the classifier is built upon some handcrafted features, which include source sites, text length, the total number of images, etc. We could employ topic model to obtain the topical distribution of each article and augment the handcrafted features with this distribution (shown in Figure1(a)
). As an experiment, we prepare 7,000 news articles, which are manually labeled into 5 categories, in which 0 stands for those of the lowest quality, and 4 represents for the best. We train Gradient Boosting Decision Tree (GBDT) on 5,000 articles with different features and test the trained classifier on the other 2,000 articles. Figure1(b) shows the result from the two classifiers using different sets of features: baseline, baseline+LDA. The results of using features of topic model are significantly better, showing that topic model is an effective way for document representation.
Straightforwardly, the semantic representation of documents could be utilized for clustering. In the task of clustering new articles
, we use LDA to compute the topic distribution of news articles and cluster the articles by K-means. Figure2 shows two clusters which are obtained by clustering 1000 articles into ten groups. Cluster1 is of articles related to interior design and Cluster2 contains articles about the stock market. The result shows that news articles can be semantically clustered based on their topic distributions.
In text mining, we frequently encounter the need for evaluating information richness of documents. This requirement can be partially met by topic distributions of documents. We first calculate the topical distribution of the document by topic modeling, and then calculate the entropy of the topical distribution as follows:
where represents document, represents total number of topics, represents the possibility of the -th topic. The higher the entropy, the higher the richness of document content. In the scenario of web information retrieval, the richness value is utilized as a feature in sophisticated ranking models.
Another paradigm is semantic matching, which can be further categorized as short-short text matching, short-long text matching and long-long text matching.
The need for short-short text matching is common in web search, where we need to compute the semantic similarity between queries and web page titles. Due to the difficulty of topic modeling on short text, embedding-based models such as Word2Vec and TWE are much more common for this task. Assume we want to compute the semantic similarity between a query “recommend good movies” and a web page title “2016 good movies in China”, we first convert the queries into their embeddings (i.e., and
) and then compute the semantic similarity between these embeddings with the metric of cosine similarity.
In many online applications, we need to compute the semantic similarity between query and document. Since query is typically short and document content is much longer, short-long text matching is needed in this scenario. Due to the difficulty of topic inference on short text, we compute the probability of the short text generated from the topic distribution of the long text as follows:
where stands for query, for document content, for words in query and for topics.
We first discuss the task of online advertising, in which we need to compute the semantic similarity between query and ad pages. We treat each textual field on ad page as a sentence (shown in Figure 3 ) and apply SentenceLDA for this task. After obtaining the topic distribution of each ad page, we apply Eq.(3) to compute the semantic similarity between query and the ad page. Such similarity can be utilized as a feature in downstream ranking models. For a query “recording of wedding ceremony”, we compare its ranking results from two strategies in Figure 4. We can see that the result with SentenceLDA feature is better at satisfying the underlying need of the query.
An extreme case of short-long text matching is the task of keyword extraction from document. We extract a set of keywords from documents as concise and explicit representation of the document. The conventional way of extracting keywords from texts relies upon the TF and IDF information. If we want to introduce the semantic importance, we can use Eq.(4) to compute the similarity of a word and the document as follows:
where stands for document content, for each word, for word embedding for word and
for vector representation of topic. Figure 5 is a piece of news article. We use Eq.(4) to compute the similarity between each word and the whole article. Top-10 keywords (with stop words eliminated) extracted by TWE are shown in Figure 6, and we can see that the keywords from TWE preserve the important information in the news.
We can evaluate the semantic similarity between long texts by the distance of their topical distributions. Such semantic similarity can be further utilized as a feature in various machine learning models. The distance metrics of gauging two topical distributions include Hellinger Distance (HD) and Jensen-Shannon Divergence (JSD). Hellinger Distance is formally defined as follows:
where and are the -th element of the corresponding distributions. The definition of Jensen-Shannon Divergence (JSD) is as follows:
stands for Kullback-Leibler Divergence.
We now discuss the task of personalized news recommendation, which is illustrated in Figure 7. We first collect the news articles (or news titles) recently read by each user and compose them into a pseudo document. By conducting topic modeling on a corpus of pseudo documents, we obtain topic distribution of each pseudo document, and this distribution works as the corresponding user profile. In online setting, we compute the Hellinger Distance(HD) between the topic distribution of real-time news articles and the user profile, and those with low HD value are pushed to the user as personalized news feed. This approach has effectively improved the performance of real-world news feed application.
We proceed to discuss the task of personalized fiction recommendation. Matrix factorization is a common approach for industrial recommendation systems. SVDFeature Chen et al. (2012) is a framework designed to efficiently solve the feature-based matrix factorization. SVDFeature is quite flexible and is able to accommodates global features, user features and item features. SVDFearure can be mathematically described as follows:
where is target, is a constant indicating the global mean value of target, represents user feature, represents item feature, represents global feature, is weight of global feature, is weight of user feature, is weight of item feature, and are model parameters.
In the scenario of personalized fiction recommendation, each user has some historically downloaded fictions. By conduct topic modeling on these fictions, we can obtain the user’s topic representation, which works as a user profile of reading interests. By computing the JSD between the topic distribution of each fiction and the user profile, we can quantify the probability that user is interested in this fiction. We augment the aforementioned SVDFeature framework with the JSD value as a global feature (Figure 8). From a comparative study shown in Figure 9, we can see that adding JSD is effective to improve the performance of SVDFeature. SVDFeature with JSD constantly outperform its original counterpart in terms of both Precision and NDCG.
Familia is an open-source toolkit for industrial topic modeling. It supports two industrial utility paradigms: semantic representation and semantic matching. Three topic models of high industrial value have been made publicly available as well. These topic models are all trained based upon large-scale industrial corpora across various domains. We present several industrial cases where the technique of topic modeling is successfully applied. The discussion of these cases works as a guide for developers to conduct topic model selection and utilization in their own tasks. We wish Familia could help engineers to employ the technique of topic modeling in more convenient way and inspire more pragmatic research on topic models.
Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., pages 289–296.