textTOvec: Deep Contextualized Neural Autoregressive Models of Language with Distributed Compositional Prior

10/09/2018
by   Pankaj Gupta, et al.
0

We address two challenges of probabilistic topic modelling in order to better estimate the probability of a word in a given context, i.e., P(word|context): (1) No Language Structure in Context: Probabilistic topic models ignore word order by summarizing a given context as a "bag-of-word" and consequently the semantics of words in the context is lost. The LSTM-LM learns a vector-space representation of each word by accounting for word order in local collocation patterns and models complex characteristics of language (e.g., syntax and semantics), while the TM simultaneously learns a latent representation from the entire document and discovers the underlying thematic structure. We unite two complementary paradigms of learning the meaning of word occurrences by combining a TM and a LM in a unified probabilistic framework, named as ctx-DocNADE. (2) Limited Context and/or Smaller training corpus of documents: In settings with a small number of word occurrences (i.e., lack of context) in short text or data sparsity in a corpus of few documents, the application of TMs is challenging. We address this challenge by incorporating external knowledge into neural autoregressive topic models via a language modelling approach: we use word embeddings as input of a LSTM-LM with the aim to improve the word-topic mapping on a smaller and/or short-text corpus. The proposed DocNADE extension is named as ctx-DocNADEe. We present novel neural autoregressive topic model variants coupled with neural LMs and embeddings priors that consistently outperform state-of-the-art generative TMs in terms of generalization (perplexity), interpretability (topic coherence) and applicability (retrieval and classification) over 6 long-text and 8 short-text datasets from diverse domains.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

10/09/2018

textTOvec: Deep Contextualized Neural Autoregressive Topic Models of Language with Distributed Compositional Prior

We address two challenges of probabilistic topic modelling in order to b...
09/15/2018

Document Informed Neural Autoregressive Topic Models with Distributional Prior

We address two challenges in topic models: (1) Context information aroun...
09/14/2019

Multi-view and Multi-source Transfers in Neural Topic Modeling

Though word embeddings and topics are complementary representations, sev...
04/17/2021

Multi-source Neural Topic Modeling in Multi-view Embedding Spaces

Though word embeddings and topics are complementary representations, sev...
08/11/2018

Document Informed Neural Autoregressive Topic Models

Context information around words helps in determining their actual meani...
11/22/2019

Topical Phrase Extraction from Clinical Reports by Incorporating both Local and Global Context

Making sense of words often requires to simultaneously examine the surro...
09/19/2017

MetaLDA: a Topic Model that Efficiently Incorporates Meta information

Besides the text content, documents and their associated words usually c...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.