textTOvec: Deep Contextualized Neural Autoregressive Models of Language with Distributed Compositional Prior

10/09/2018
by   Pankaj Gupta, et al.
0

We address two challenges of probabilistic topic modelling in order to better estimate the probability of a word in a given context, i.e., P(word|context): (1) No Language Structure in Context: Probabilistic topic models ignore word order by summarizing a given context as a "bag-of-word" and consequently the semantics of words in the context is lost. The LSTM-LM learns a vector-space representation of each word by accounting for word order in local collocation patterns and models complex characteristics of language (e.g., syntax and semantics), while the TM simultaneously learns a latent representation from the entire document and discovers the underlying thematic structure. We unite two complementary paradigms of learning the meaning of word occurrences by combining a TM and a LM in a unified probabilistic framework, named as ctx-DocNADE. (2) Limited Context and/or Smaller training corpus of documents: In settings with a small number of word occurrences (i.e., lack of context) in short text or data sparsity in a corpus of few documents, the application of TMs is challenging. We address this challenge by incorporating external knowledge into neural autoregressive topic models via a language modelling approach: we use word embeddings as input of a LSTM-LM with the aim to improve the word-topic mapping on a smaller and/or short-text corpus. The proposed DocNADE extension is named as ctx-DocNADEe. We present novel neural autoregressive topic model variants coupled with neural LMs and embeddings priors that consistently outperform state-of-the-art generative TMs in terms of generalization (perplexity), interpretability (topic coherence) and applicability (retrieval and classification) over 6 long-text and 8 short-text datasets from diverse domains.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/09/2018

textTOvec: Deep Contextualized Neural Autoregressive Topic Models of Language with Distributed Compositional Prior

We address two challenges of probabilistic topic modelling in order to b...
research
09/15/2018

Document Informed Neural Autoregressive Topic Models with Distributional Prior

We address two challenges in topic models: (1) Context information aroun...
research
08/28/2023

CommunityFish: A Poisson-based Document Scaling With Hierarchical Clustering

Document scaling has been a key component in text-as-data applications f...
research
09/14/2019

Multi-view and Multi-source Transfers in Neural Topic Modeling with Pretrained Topic and Word Embeddings

Though word embeddings and topics are complementary representations, sev...
research
11/22/2019

Topical Phrase Extraction from Clinical Reports by Incorporating both Local and Global Context

Making sense of words often requires to simultaneously examine the surro...
research
08/11/2018

Document Informed Neural Autoregressive Topic Models

Context information around words helps in determining their actual meani...
research
07/07/2016

Predicting and Understanding Law-Making with Word Vectors and an Ensemble Model

Out of nearly 70,000 bills introduced in the U.S. Congress from 2001 to ...

Please sign up or login with your details

Forgot password? Click here to reset