Familia: A Configurable Topic Modeling Framework for Industrial Text Engineering

08/11/2018
by   Di Jiang, et al.
0

In the last decade, a variety of topic models have been proposed for text engineering. However, except Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA), most of existing topic models are seldom applied or considered in industrial scenarios. This phenomenon is caused by the fact that there are very few convenient tools to support these topic models so far. Intimidated by the demanding expertise and labor of designing and implementing parameter inference algorithms, software engineers are prone to simply resort to PLSA/LDA, without considering whether it is proper for their problem at hand or not. In this paper, we propose a configurable topic modeling framework named Familia, in order to bridge the huge gap between academic research fruits and current industrial practice. Familia supports an important line of topic models that are widely applicable in text engineering scenarios. In order to relieve burdens of software engineers without knowledge of Bayesian networks, Familia is able to conduct automatic parameter inference for a variety of topic models. Simply through changing the data organization of Familia, software engineers are able to easily explore a broad spectrum of existing topic models or even design their own topic models, and find the one that best suits the problem at hand. With its superior extendability, Familia has a novel sampling mechanism that strikes balance between effectiveness and efficiency of parameter inference. Furthermore, Familia is essentially a big topic modeling framework that supports parallel parameter inference and distributed parameter storage. The utilities and necessity of Familia are demonstrated in real-life industrial applications. Familia would significantly enlarge software engineers' arsenal of topic models and pave the way for utilizing highly customized topic models in real-life problems.

READ FULL TEXT
research
11/12/2017

Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey

Topic modeling is one of the most powerful techniques in text mining for...
research
07/31/2017

Familia: An Open-Source Toolkit for Industrial Topic Modeling

Familia is an open-source toolkit for pragmatic topic modeling in indust...
research
08/29/2016

What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)

Context: Topic modeling finds human-readable structures in unstructured ...
research
04/07/2016

Combinatorial Topic Models using Small-Variance Asymptotics

Topic models have emerged as fundamental tools in unsupervised machine l...
research
05/24/2016

Computing Web-scale Topic Models using an Asynchronous Parameter Server

Topic models such as Latent Dirichlet Allocation (LDA) have been widely ...
research
11/25/2013

Learning Reputation in an Authorship Network

The problem of searching for experts in a given academic field is hugely...
research
08/24/2017

From close the door to do not click and back. Security by design for older adults

With the growing number of older adults who adopt mobile technology in t...

Please sign up or login with your details

Forgot password? Click here to reset