Top2Vec: Distributed Representations of Topics

08/19/2020
by   Dimo Angelov, et al.
24

Topic modeling is used for discovering latent semantic structure, usually referred to as topics, in a large collection of documents. The most widely used methods are Latent Dirichlet Allocation and Probabilistic Latent Semantic Analysis. Despite their popularity they have several weaknesses. In order to achieve optimal results they often require the number of topics to be known, custom stop-word lists, stemming, and lemmatization. Additionally these methods rely on bag-of-words representation of documents which ignore the ordering and semantics of words. Distributed representations of documents and words have gained popularity due to their ability to capture semantics of words and documents. We present top2vec, which leverages joint document and word semantic embedding to find topic vectors. This model does not require stop-word lists, stemming or lemmatization, and it automatically finds the number of topics. The resulting topic vectors are jointly embedded with the document and word vectors with distance between them representing semantic similarity. Our experiments demonstrate that top2vec finds topics which are significantly more informative and representative of the corpus trained on than probabilistic generative models.

READ FULL TEXT
02/25/2020

Gaussian Hierarchical Latent Dirichlet Allocation: Bringing Polysemy Back

Topic models are widely used to discover the latent representation of a ...
11/25/2019

Discovering topics with neural topic models built from PLSA assumptions

In this paper we present a model for unsupervised topic discovery in tex...
11/24/2017

Continuous Semantic Topic Embedding Model Using Variational Autoencoder

This paper proposes the continuous semantic topic embedding model (CSTEM...
02/12/2015

Ordering-sensitive and Semantic-aware Topic Modeling

Topic modeling of textual corpora is an important and challenging proble...
10/23/2018

Topic representation: finding more representative words in topic models

The top word list, i.e., the top-M words with highest marginal probabili...
02/14/2012

Multidimensional counting grids: Inferring word order from disordered bags of words

Models of bags of words typically assume topic mixing so that the words ...
05/31/2022

LEXpander: applying colexification networks to automated lexicon expansion

Recent approaches to text analysis from social media and other corpora r...

Code Repositories

Top2Vec

Top2Vec learns jointly embedded topic, document and word vectors.


view repo

doc2vec

Distributed Representations of Sentences and Documents


view repo

Topic_Modelling_Top2Vec_BERTopic

None


view repo

TOP_BRAIN_NLP_MODEL

This is a topic NLP model that can classify and discover topics in documents


view repo

doc2vec

:exclamation: This is a read-only mirror of the CRAN R package repository. doc2vec — Distributed Representations of Sentences, Documents and Topics. Homepage: https://github.com/bnosac/doc2vec


view repo