Multidimensional counting grids: Inferring word order from disordered bags of words

02/14/2012
by   Nebojsa Jojic, et al.
0

Models of bags of words typically assume topic mixing so that the words in a single bag come from a limited number of topics. We show here that many sets of bag of words exhibit a very different pattern of variation than the patterns that are efficiently captured by topic mixing. In many cases, from one bag of words to the next, the words disappear and new ones appear as if the theme slowly and smoothly shifted across documents (providing that the documents are somehow ordered). Examples of latent structure that describe such ordering are easily imagined. For example, the advancement of the date of the news stories is reflected in a smooth change over the theme of the day as certain evolving news stories fall out of favor and new events create new stories. Overlaps among the stories of consecutive days can be modeled by using windows over linearly arranged tight distributions over words. We show here that such strategy can be extended to multiple dimensions and cases where the ordering of data is not readily obvious. We demonstrate that this way of modeling covariation in word occurrences outperforms standard topic models in classification and prediction tasks in applications in biology, text modeling and computer vision.

READ FULL TEXT

page 1

page 5

page 7

page 9

research
08/19/2020

Top2Vec: Distributed Representations of Topics

Topic modeling is used for discovering latent semantic structure, usuall...
research
02/12/2015

Ordering-sensitive and Semantic-aware Topic Modeling

Topic modeling of textual corpora is an important and challenging proble...
research
06/26/2018

Unveiling the semantic structure of text documents using paragraph-aware Topic Models

Classic Topic Models are built under the Bag Of Words assumption, in whi...
research
09/29/2019

Lifelong Neural Topic Learning in Contextualized Autoregressive Topic Models of Language via Informative Transfers

Topic models such as LDA, DocNADE, iDocNADEe have been popular in docume...
research
09/16/2019

Short-Text Classification Using Unsupervised Keyword Expansion

Short-text classification, like all data science, struggles to achieve h...
research
03/26/2020

Bag of biterms modeling for short texts

Analyzing texts from social media encounters many challenges due to thei...
research
07/10/2020

Handling Collocations in Hierarchical Latent Tree Analysis for Topic Modeling

Topic modeling has been one of the most active research areas in machine...

Please sign up or login with your details

Forgot password? Click here to reset