Per-Corpus Configuration of Topic Modelling for GitHub and Stack Overflow Collections

04/13/2018
by   Christoph Treude, et al.
0

To make sense of large amounts of textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima, (ii) an a-posteriori characterisation of text corpora related to eight programming languages from GitHub and Stack Overflow, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration.

READ FULL TEXT
research
11/12/2017

Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey

Topic modeling is one of the most powerful techniques in text mining for...
research
08/05/2015

Topic Stability over Noisy Sources

Topic modelling techniques such as LDA have recently been applied to spe...
research
12/16/2022

Experiments on Generalizability of BERTopic on Multi-Domain Short Text

Topic modeling is widely used for analytically evaluating large collecti...
research
02/08/2015

Hierarchical Dirichlet process for tracking complex topical structure evolution and its application to autism research literature

In this paper we describe a novel framework for the discovery of the top...
research
06/17/2019

Analyses of Multi-collection Corpora via Compound Topic Modeling

As electronically stored data grow in daily life, obtaining novel and re...
research
08/29/2016

What is Wrong with Topic Modeling? (and How to Fix it Using Search-based Software Engineering)

Context: Topic modeling finds human-readable structures in unstructured ...
research
04/27/2018

Can You Explain That, Better? Comprehensible Text Analytics for SE Applications

Text mining methods are used for a wide range of Software Engineering (S...

Please sign up or login with your details

Forgot password? Click here to reset