Familia: An Open-Source Toolkit for Industrial Topic Modeling

07/31/2017 ∙ by Di Jiang, et al. ∙ Baidu, Inc. 0

Familia is an open-source toolkit for pragmatic topic modeling in industry. Familia abstracts the utilities of topic modeling in industry as two paradigms: semantic representation and semantic matching. Efficient implementations of the two paradigms are made publicly available for the first time. Furthermore, we provide off-the-shelf topic models trained on large-scale industrial corpora, including Latent Dirichlet Allocation (LDA), SentenceLDA and Topical Word Embedding (TWE). We further describe typical applications which are successfully powered by topic modeling, in order to ease the confusions and difficulties of software engineers during topic model selection and utilization.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


A Toolkit for Industrial Topic Modeling

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Topic modeling is a well-recognized approach for organizing, searching and understanding the vast amounts of documents Blei (2012). In the last decade, various topic models have been proposed in academia. However, only a very small portion of these models have been applied in industry. The major problems that hinder the wide adoption of topic modeling in industry are essentially twofold. First, many research and engineering endeavors are devoted to Probabilistic Latent Semantic Analysis (PLSA) Hofmann (1999) and its fully Bayesian counterpart Latent Dirichlet Allocation (LDA) Blei et al. (2003) , while the vast majority of the other topics models lack open-source implementation. Second, most of existing research and engineering work focus on “designing new topic models” rather than “applying topic modeling results in real-life application”. This discrepancy leads to the fact that there is little work discussing how to properly utilize topic modeling results in real-life scenarios. Given the absence of a comprehensive guide, it is often difficult for developers to select an appropriate topic model for their tasks and to apply the model in a proper way.

As one step toward bridging the gap between topic modeling research and industrial applications, we open-source a topic modeling toolkit named Familia111https://github.com/baidu/Familia. Three off-the-shelf topic models are readily provided in Familia, including LDA, SentenceLDA Jo and Oh (2011) and TWE Liu et al. (2015), which are trained based upon large-scale industrial corpora. Developers have the freedom of exploring more topic models beyond LDA. For effective usage in industrial applications, efficient implementations of two utility paradigms (i.e., semantic representation and semantic matching) have been made publicly available in Familia. We further discuss several industrial cases that benefit from the technique of topic modeling, aiming to easing the confusions and difficulties for developers during topic model selection and application. The rest of this paper is organized as follows. We detail the open-sourced content of Familia in Section 2. Then we discuss several industrial cases in Sections 3. Finally, we conclude the paper in Section 4.

2 Familia Overview

Familia is composed of two major contents: source code for two industrial utility paradigms and the trained topic models.

2.1 Industrial Utility Paradigms

The industrial utilities of topic models can be broadly divided into two categories: semantic representation and semantic matching. For semantic representation, Familia provides two kinds of Markov chain Monte Carlo (MCMC) algorithms for users to investigate and choose: Gibbs sampling

Griffiths and Steyvers (2004) and Metropolis-Hastings Yuan et al. (2015). For semantic matching, Familia contains some functions for calculating the semantic similarity between texts of different lengths: short-long text matching and long-long text matching. Moreover, in order to better display the details of trained topic models, Familia also provides functions such as nearest-word querying, topic-word querying, etc.

2.2 Trained Topic Models

Three off-the-shelf topic models of high industrial value are publicly released in Familia. Each of them is trained on the large-scale industrial corpora. The characteristics of the three topics are briefly summarized as follows:

  • LDA: Each document is represented as a mixture over latent topics and each topic is modeled as a distribution of words.

  • SentenceLDA: SentenceLDA assumes that the words within one sentence are generated by the same topic. It models the co-occurrence of words in finer granularity than LDA.

  • TWE: TWE utilizes LDA topics as the complementary information for training word embeddings. Hence, TWE can provide both topic embeddings and word embeddings. As the topics derived by LDA are often dominated by words of high-frequency, the embeddings of TWE can partially alleviate this problem by capturing the semantics of low-frequency words under each topic.

3 Industrial Cases

In this section, we discuss several industrial cases that benefit from the aforementioned paradigms and models. This section acts like a guide for developers to select appropriate topic models for their tasks and apply them in a proper way.

3.1 Semantic Representation

We first discuss some cases involving semantic representation. The semantic representation derived by topic modeling typically works as features for other machine learning models.

3.1.1 Document Classification

(a) News Topics Distribution as Augmented Features of GBDT
(b) Experimental Results of News Classification
Figure 1: Classification of News Articles

The first case is classification of news articles

. For news feed service, the articles collected from various sources often contain low-quality ones. In order to improve user experience, we need to design a classifier to distinguish the good ones from the bad ones. Conventionally, the classifier is built upon some handcrafted features, which include source sites, text length, the total number of images, etc. We could employ topic model to obtain the topical distribution of each article and augment the handcrafted features with this distribution (shown in Figure


). As an experiment, we prepare 7,000 news articles, which are manually labeled into 5 categories, in which 0 stands for those of the lowest quality, and 4 represents for the best. We train Gradient Boosting Decision Tree (GBDT) on 5,000 articles with different features and test the trained classifier on the other 2,000 articles. Figure

1(b) shows the result from the two classifiers using different sets of features: baseline, baseline+LDA. The results of using features of topic model are significantly better, showing that topic model is an effective way for document representation.

3.1.2 Document Clustering

Straightforwardly, the semantic representation of documents could be utilized for clustering. In the task of clustering new articles

, we use LDA to compute the topic distribution of news articles and cluster the articles by K-means. Figure 

2 shows two clusters which are obtained by clustering 1000 articles into ten groups. Cluster1 is of articles related to interior design and Cluster2 contains articles about the stock market. The result shows that news articles can be semantically clustered based on their topic distributions.

Figure 2: Example of Clustering News Articles

3.1.3 Document Information Richness

In text mining, we frequently encounter the need for evaluating information richness of documents. This requirement can be partially met by topic distributions of documents. We first calculate the topical distribution of the document by topic modeling, and then calculate the entropy of the topical distribution as follows:


where represents document, represents total number of topics, represents the possibility of the -th topic. The higher the entropy, the higher the richness of document content. In the scenario of web information retrieval, the richness value is utilized as a feature in sophisticated ranking models.

3.2 Semantic Matching

Another paradigm is semantic matching, which can be further categorized as short-short text matching, short-long text matching and long-long text matching.

3.2.1 Short-Short Text Matching

The need for short-short text matching is common in web search, where we need to compute the semantic similarity between queries and web page titles. Due to the difficulty of topic modeling on short text, embedding-based models such as Word2Vec and TWE are much more common for this task. Assume we want to compute the semantic similarity between a query “recommend good movies” and a web page title “2016 good movies in China”, we first convert the queries into their embeddings (i.e., and

) and then compute the semantic similarity between these embeddings with the metric of cosine similarity.


There are more sophisticated short-short text matching mechanisms in literature, interested readers may refer to deep neural network based models such as Deep Structured Semantic Model (DSSM)

Huang et al. (2013) and Convolutional Latent Semantic Model (CLSM) Shen et al. (2014).

3.2.2 Short-Long Text Matching

In many online applications, we need to compute the semantic similarity between query and document. Since query is typically short and document content is much longer, short-long text matching is needed in this scenario. Due to the difficulty of topic inference on short text, we compute the probability of the short text generated from the topic distribution of the long text as follows:


where stands for query, for document content, for words in query and for topics.

Figure 3: Example of Ad Page
(a) Baseline
(b) Result with SentenceLDA Feature
Figure 4: Semantic Matching of Query-Ad

We first discuss the task of online advertising, in which we need to compute the semantic similarity between query and ad pages. We treat each textual field on ad page as a sentence (shown in Figure 3 ) and apply SentenceLDA for this task. After obtaining the topic distribution of each ad page, we apply Eq.(3) to compute the semantic similarity between query and the ad page. Such similarity can be utilized as a feature in downstream ranking models. For a query “recording of wedding ceremony”, we compare its ranking results from two strategies in Figure 4. We can see that the result with SentenceLDA feature is better at satisfying the underlying need of the query.

An extreme case of short-long text matching is the task of keyword extraction from document. We extract a set of keywords from documents as concise and explicit representation of the document. The conventional way of extracting keywords from texts relies upon the TF and IDF information. If we want to introduce the semantic importance, we can use Eq.(4) to compute the similarity of a word and the document as follows:


where stands for document content, for each word, for word embedding for word and

for vector representation of topic

. Figure 5 is a piece of news article. We use Eq.(4) to compute the similarity between each word and the whole article. Top-10 keywords (with stop words eliminated) extracted by TWE are shown in Figure 6, and we can see that the keywords from TWE preserve the important information in the news.

Figure 5: Example of News Article
Figure 6: Keyword Extraction based on TWE

3.2.3 Long-Long Text Matching

We can evaluate the semantic similarity between long texts by the distance of their topical distributions. Such semantic similarity can be further utilized as a feature in various machine learning models. The distance metrics of gauging two topical distributions include Hellinger Distance (HD) and Jensen-Shannon Divergence (JSD). Hellinger Distance is formally defined as follows:


where and are the -th element of the corresponding distributions. The definition of Jensen-Shannon Divergence (JSD) is as follows:



stands for Kullback-Leibler Divergence.

We now discuss the task of personalized news recommendation, which is illustrated in Figure 7. We first collect the news articles (or news titles) recently read by each user and compose them into a pseudo document. By conducting topic modeling on a corpus of pseudo documents, we obtain topic distribution of each pseudo document, and this distribution works as the corresponding user profile. In online setting, we compute the Hellinger Distance(HD) between the topic distribution of real-time news articles and the user profile, and those with low HD value are pushed to the user as personalized news feed. This approach has effectively improved the performance of real-world news feed application.

Figure 7: Personalized News Recommendation
Figure 8: SVDFeature with Topic Feature

We proceed to discuss the task of personalized fiction recommendation. Matrix factorization is a common approach for industrial recommendation systems. SVDFeature Chen et al. (2012) is a framework designed to efficiently solve the feature-based matrix factorization. SVDFeature is quite flexible and is able to accommodates global features, user features and item features. SVDFearure can be mathematically described as follows:


where is target, is a constant indicating the global mean value of target, represents user feature, represents item feature, represents global feature, is weight of global feature, is weight of user feature, is weight of item feature, and are model parameters.

In the scenario of personalized fiction recommendation, each user has some historically downloaded fictions. By conduct topic modeling on these fictions, we can obtain the user’s topic representation, which works as a user profile of reading interests. By computing the JSD between the topic distribution of each fiction and the user profile, we can quantify the probability that user is interested in this fiction. We augment the aforementioned SVDFeature framework with the JSD value as a global feature (Figure 8). From a comparative study shown in Figure 9, we can see that adding JSD is effective to improve the performance of SVDFeature. SVDFeature with JSD constantly outperform its original counterpart in terms of both Precision and NDCG.

Figure 9: Fiction Recommendation Performance

4 Conclusion

Familia is an open-source toolkit for industrial topic modeling. It supports two industrial utility paradigms: semantic representation and semantic matching. Three topic models of high industrial value have been made publicly available as well. These topic models are all trained based upon large-scale industrial corpora across various domains. We present several industrial cases where the technique of topic modeling is successfully applied. The discussion of these cases works as a guide for developers to conduct topic model selection and utilization in their own tasks. We wish Familia could help engineers to employ the technique of topic modeling in more convenient way and inspire more pragmatic research on topic models.


  • Blei (2012) David M Blei. 2012. Probabilistic topic models. Communications of the ACM 55(4):77–84.
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3(Jan):993–1022.
  • Chen et al. (2012) Tianqi Chen, Weinan Zhang, Qiuxia Lu, Kailong Chen, Zhao Zheng, and Yong Yu. 2012. Svdfeature: a toolkit for feature-based collaborative filtering. Journal of Machine Learning Research 13(Dec):3619–3622.
  • Griffiths and Steyvers (2004) Thomas L Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National academy of Sciences 101(suppl 1):5228–5235.
  • Hofmann (1999) Thomas Hofmann. 1999. Probabilistic latent semantic analysis. In

    Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

    . Morgan Kaufmann Publishers Inc., pages 289–296.
  • Huang et al. (2013) Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck. 2013. Learning deep structured semantic models for web search using clickthrough data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management. ACM, pages 2333–2338.
  • Jo and Oh (2011) Yohan Jo and Alice H Oh. 2011. Aspect and sentiment unification model for online review analysis. In Proceedings of the fourth ACM international conference on Web search and data mining. ACM, pages 815–824.
  • Liu et al. (2015) Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong Sun. 2015. Topical word embeddings. In AAAI. pages 2418–2424.
  • Shen et al. (2014) Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grégoire Mesnil. 2014. A latent semantic model with convolutional-pooling structure for information retrieval. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. ACM, pages 101–110.
  • Yuan et al. (2015) Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. 2015. Lightlda: Big topic models on modest computer clusters. In Proceedings of the 24th International Conference on World Wide Web. ACM, pages 1351–1361.