BATS: A Spectral Biclustering Approach to Single Document Topic Modeling and Segmentation

08/05/2020
by   Sirui Wang, et al.
0

Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available. In this work, we reexamine the inter-related problems of "topic identification" and "text segmentation" for sparse document learning, when there is a single new text of interest. In developing a methodology to handle single documents, we face two major challenges. First is sparse information: with access to only one document, we cannot train traditional topic models or deep learning algorithms. Second is significant noise: a considerable portion of words in any single document will produce only noise and not help discern topics or segments. To tackle these issues, we design an unsupervised, computationally efficient methodology called BATS: Biclustering Approach to Topic modeling and Segmentation. BATS leverages three key ideas to simultaneously identify topics and segment text: (i) a new mechanism that uses word order information to reduce sample complexity, (ii) a statistically sound graph-based biclustering technique that identifies latent structures of words and sentences, and (iii) a collection of effective heuristics that remove noise words and award important words to further improve performance. Experiments on four datasets show that our approach outperforms several state-of-the-art baselines when considering topic coherence, topic diversity, segmentation, and runtime comparison metrics.

READ FULL TEXT

page 9

page 18

research
10/12/2018

HiTR: Hierarchical Topic Model Re-estimation for Measuring Topical Diversity of Documents

A high degree of topical diversity is often considered to be an importan...
research
10/15/2018

Improving Topic Models with Latent Feature Word Representations

Probabilistic topic models are widely used to discover latent topics in ...
research
11/04/2016

Generalized Topic Modeling

Recently there has been significant activity in developing algorithms wi...
research
10/31/2021

Conical Classification For Computationally Efficient One-Class Topic Determination

As the Internet grows in size, so does the amount of text based informat...
research
11/25/2019

My Approach = Your Apparatus? Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections

Comparative text mining extends from genre analysis and political bias d...
research
03/15/2013

Topic Discovery through Data Dependent and Random Projections

We present algorithms for topic modeling based on the geometry of cross-...
research
10/25/2016

Scalable Dynamic Topic Modeling with Clustered Latent Dirichlet Allocation (CLDA)

Topic modeling, a method for extracting the underlying themes from a col...

Please sign up or login with your details

Forgot password? Click here to reset