textTOvec: Deep Contextualized Neural Autoregressive Topic Models of Language with Distributed Compositional Prior

10/09/2018
by   Pankaj Gupta, et al.
0

We address two challenges of probabilistic topic modelling in order to better estimate the probability of a word in a given context, i.e., P(word|context): (1) No Language Structure in Context: Probabilistic topic models ignore word order by summarizing a given context as a "bag-of-word" and consequently the semantics of words in the context is lost. The LSTM-LM learns a vector-space representation of each word by accounting for word order in local collocation patterns and models complex characteristics of language (e.g., syntax and semantics), while the TM simultaneously learns a latent representation from the entire document and discovers the underlying thematic structure. We unite two complementary paradigms of learning the meaning of word occurrences by combining a TM (e.g., DocNADE) and a LM in a unified probabilistic framework, named as ctx-DocNADE. (2) Limited Context and/or Smaller training corpus of documents: In settings with a small number of word occurrences (i.e., lack of context) in short text or data sparsity in a corpus of few documents, the application of TMs is challenging. We address this challenge by incorporating external knowledge into neural autoregressive topic models via a language modelling approach: we use word embeddings as input of a LSTM-LM with the aim to improve the word-topic mapping on a smaller and/or short-text corpus. The proposed DocNADE extension is named as ctx-DocNADEe. We present novel neural autoregressive topic model variants coupled with neural LMs and embeddings priors that consistently outperform state-of-the-art generative TMs in terms of generalization (perplexity), interpretability (topic coherence) and applicability (retrieval and classification) over 6 long-text and 8 short-text datasets from diverse domains.

READ FULL TEXT
research
10/09/2018

textTOvec: Deep Contextualized Neural Autoregressive Models of Language with Distributed Compositional Prior

We address two challenges of probabilistic topic modelling in order to b...
research
09/15/2018

Document Informed Neural Autoregressive Topic Models with Distributional Prior

We address two challenges in topic models: (1) Context information aroun...
research
08/11/2018

Document Informed Neural Autoregressive Topic Models

Context information around words helps in determining their actual meani...
research
11/22/2019

Topical Phrase Extraction from Clinical Reports by Incorporating both Local and Global Context

Making sense of words often requires to simultaneously examine the surro...
research
08/28/2023

CommunityFish: A Poisson-based Document Scaling With Hierarchical Clustering

Document scaling has been a key component in text-as-data applications f...
research
08/11/2020

Context Reinforced Neural Topic Modeling over Short Texts

As one of the prevalent topic mining tools, neural topic modeling has at...
research
09/19/2017

MetaLDA: a Topic Model that Efficiently Incorporates Meta information

Besides the text content, documents and their associated words usually c...

Please sign up or login with your details

Forgot password? Click here to reset