Topic Stability over Noisy Sources

08/05/2015
by   Jing Su, et al.
0

Topic modelling techniques such as LDA have recently been applied to speech transcripts and OCR output. These corpora may contain noisy or erroneous texts which may undermine topic stability. Therefore, it is important to know how well a topic modelling algorithm will perform when applied to noisy data. In this paper we show that different types of textual noise will have diverse effects on the stability of different topic models. From these observations, we propose guidelines for text corpus generation, with a focus on automatic speech transcription. We also suggest topic model selection methods for noisy corpora.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/13/2018

Per-Corpus Configuration of Topic Modelling for GitHub and Stack Overflow Collections

To make sense of large amounts of textual data, topic modelling is frequ...
research
02/13/2023

Visualizing Topic Uncertainty in Topic Modelling

Word clouds became a standard tool for presenting results of natural lan...
research
04/15/2019

A framework for streamlined statistical prediction using topic models

In the Humanities and Social Sciences, there is increasing interest in a...
research
11/21/2021

Jointly Dynamic Topic Model for Recognition of Lead-lag Relationship in Two Text Corpora

Topic evolution modeling has received significant attentions in recent d...
research
02/28/2018

Application of Rényi and Tsallis Entropies to Topic Modeling Optimization

This is full length article (draft version) where problem number of topi...
research
05/04/2020

Modelling Grocery Retail Topic Distributions: Evaluation, Interpretability and Stability

Understanding the shopping motivations behind market baskets has high co...
research
12/21/2016

Inverted Bilingual Topic Models for Lexicon Extraction from Non-parallel Data

Topic models have been successfully applied in lexicon extraction. Howev...

Please sign up or login with your details

Forgot password? Click here to reset