A high-reproducibility and high-accuracy method for automated topic classification

02/03/2014
by   Andrea Lancichinetti, et al.
0

Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent search, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state-of-the-art in topic classification. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results which are not accurate in inferring the most suitable model parameters. Adapting approaches for community detection in networks, we propose a new algorithm which displays high-reproducibility and high-accuracy, and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure. Our algorithm promises to make "big data" text analysis systems more reliable.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/04/2017

A network approach to topic models

One of the main computational and scientific challenges in the modern ag...
research
05/17/2021

TopicsRanksDC: Distance-based Topic Ranking applied on Two-Class Data

In this paper, we introduce a novel approach named TopicsRanksDC for top...
research
09/24/2019

Diachronic Topics in New High German Poetry

Statistical topic models are increasingly and popularly used by Digital ...
research
12/16/2022

Experiments on Generalizability of BERTopic on Multi-Domain Short Text

Topic modeling is widely used for analytically evaluating large collecti...
research
09/26/2014

Topic Similarity Networks: Visual Analytics for Large Document Sets

We investigate ways in which to improve the interpretability of LDA topi...
research
07/28/2023

SAP-sLDA: An Interpretable Interface for Exploring Unstructured Text

A common way to explore text corpora is through low-dimensional projecti...
research
12/18/2020

Technical Progress Analysis Using a Dynamic Topic Model for Technical Terms to Revise Patent Classification Codes

Japanese patents are assigned a patent classification code, FI (File Ind...

Please sign up or login with your details

Forgot password? Click here to reset