Efficient and Flexible Topic Modeling using Pretrained Embeddings and Bag of Sentences

02/06/2023
by   Johannes Schneider, et al.
0

Pre-trained language models have led to a new state-of-the-art in many NLP tasks. However, for topic modeling, statistical generative models such as LDA are still prevalent, which do not easily allow incorporating contextual word vectors. They might yield topics that do not align very well with human judgment. In this work, we propose a novel topic modeling and inference algorithm. We suggest a bag of sentences (BoS) approach using sentences as the unit of analysis. We leverage pre-trained sentence embeddings by combining generative process models with clustering. We derive a fast inference algorithm based on expectation maximization, hard assignments, and an annealing process. Our evaluation shows that our method yields state-of-the art results with relatively little computational demands. Our methods is more flexible compared to prior works leveraging word embeddings, since it provides the possibility to customize topic-document distributions using priors. Code is at <https://github.com/JohnTailor/BertSenClu>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/30/2020

Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too!

Topic models are a useful analysis tool to uncover the underlying themes...
research
05/16/2023

BERTTM: Leveraging Contextualized Word Embeddings from Pre-trained Language Models for Neural Topic Modeling

With the development of neural topic models in recent years, topic model...
research
03/11/2019

ETNLP: A Toolkit for Extraction, Evaluation and Visualization of Pre-trained Word Embeddings

In this paper, we introduce a comprehensive toolkit, ETNLP, which can ev...
research
10/14/2021

Neural Attention-Aware Hierarchical Topic Model

Neural topic models (NTMs) apply deep neural networks to topic modelling...
research
05/20/2017

Mixed Membership Word Embeddings for Computational Social Science

Word embeddings improve the performance of NLP systems by revealing the ...
research
05/01/2017

Learning Topic-Sensitive Word Representations

Distributed word representations are widely used for modeling words in N...
research
02/24/2023

NoPPA: Non-Parametric Pairwise Attention Random Walk Model for Sentence Representation

We propose a novel non-parametric/un-trainable language model, named Non...

Please sign up or login with your details

Forgot password? Click here to reset