A network approach to topic models

08/04/2017
by   Martin Gerlach, et al.
0

One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach which infers the latent topical structure of a collection of documents. Despite their success --- in particular of its most widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, e.g. a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. Here, we approach the problem of identifying topical structures by representing text corpora as bipartite networks of documents and words and using methods from community detection in complex networks, in particular stochastic block models (SBM). We show that our SBM-based approach constitutes a more principled and versatile framework for topic modeling solving the intrinsic limitations of Dirichlet-based models through a more general choice of nonparametric priors. It automatically detects the number of topics and hierarchically clusters both the words and documents. In practice, we demonstrate through the analysis of artificial and real corpora that our approach outperforms LDA in terms of statistical model selection.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/17/2021

TopicsRanksDC: Distance-based Topic Ranking applied on Two-Class Data

In this paper, we introduce a novel approach named TopicsRanksDC for top...
research
01/06/2020

Topic Extraction of Crawled Documents Collection using Correlated Topic Model in MapReduce Framework

The tremendous increase in the amount of available research documents im...
research
02/03/2014

A high-reproducibility and high-accuracy method for automated topic classification

Much of human knowledge sits in large databases of unstructured text. Le...
research
07/11/2017

Look Who's Talking: Bipartite Networks as Representations of a Topic Model of New Zealand Parliamentary Speeches

Quantitative methods to measure the participation to parliamentary debat...
research
01/24/2013

Transfer Topic Modeling with Ease and Scalability

The increasing volume of short texts generated on social media sites, su...
research
12/28/2022

Choosing the Number of Topics in LDA Models – A Monte Carlo Comparison of Selection Criteria

Selecting the number of topics in LDA models is considered to be a diffi...
research
07/29/2017

Topology Analysis of International Networks Based on Debates in the United Nations

In complex, high dimensional and unstructured data it is often difficult...

Please sign up or login with your details

Forgot password? Click here to reset