textTOvec: Deep Contextualized Neural Autoregressive Topic Models of Language with Distributed Compositional Prior

10/09/2018
by   Pankaj Gupta, et al.
0

We address two challenges of probabilistic topic modelling in order to better estimate the probability of a word in a given context, i.e., P(word|context): (1) No Language Structure in Context: Probabilistic topic models ignore word order by summarizing a given context as a "bag-of-word" and consequently the semantics of words in the context is lost. The LSTM-LM learns a vector-space representation of each word by accounting for word order in local collocation patterns and models complex characteristics of language (e.g., syntax and semantics), while the TM simultaneously learns a latent representation from the entire document and discovers the underlying thematic structure. We unite two complementary paradigms of learning the meaning of word occurrences by combining a TM (e.g., DocNADE) and a LM in a unified probabilistic framework, named as ctx-DocNADE. (2) Limited Context and/or Smaller training corpus of documents: In settings with a small number of word occurrences (i.e., lack of context) in short text or data sparsity in a corpus of few documents, the application of TMs is challenging. We address this challenge by incorporating external knowledge into neural autoregressive topic models via a language modelling approach: we use word embeddings as input of a LSTM-LM with the aim to improve the word-topic mapping on a smaller and/or short-text corpus. The proposed DocNADE extension is named as ctx-DocNADEe. We present novel neural autoregressive topic model variants coupled with neural LMs and embeddings priors that consistently outperform state-of-the-art generative TMs in terms of generalization (perplexity), interpretability (topic coherence) and applicability (retrieval and classification) over 6 long-text and 8 short-text datasets from diverse domains.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

10/09/2018

textTOvec: Deep Contextualized Neural Autoregressive Models of Language with Distributed Compositional Prior

We address two challenges of probabilistic topic modelling in order to b...
09/15/2018

Document Informed Neural Autoregressive Topic Models with Distributional Prior

We address two challenges in topic models: (1) Context information aroun...
08/11/2018

Document Informed Neural Autoregressive Topic Models

Context information around words helps in determining their actual meani...
11/22/2019

Topical Phrase Extraction from Clinical Reports by Incorporating both Local and Global Context

Making sense of words often requires to simultaneously examine the surro...
09/29/2019

Lifelong Neural Topic Learning in Contextualized Autoregressive Topic Models of Language via Informative Transfers

Topic models such as LDA, DocNADE, iDocNADEe have been popular in docume...
09/19/2017

MetaLDA: a Topic Model that Efficiently Incorporates Meta information

Besides the text content, documents and their associated words usually c...
07/07/2016

Predicting and Understanding Law-Making with Word Vectors and an Ensemble Model

Out of nearly 70,000 bills introduced in the U.S. Congress from 2001 to ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.