More Than Words: Collocation Tokenization for Latent Dirichlet Allocation Models

08/24/2021
by   Jin Cheevaprawatdomrong, et al.
0

Traditionally, Latent Dirichlet Allocation (LDA) ingests words in a collection of documents to discover their latent topics using word-document co-occurrences. However, it is unclear how to achieve the best results for languages without marked word boundaries such as Chinese and Thai. Here, we explore the use of Pearson's chi-squared test, t-statistics, and Word Pair Encoding (WPE) to produce tokens as input to the LDA model. The Chi-squared, t, and WPE tokenizers are trained on Wikipedia text to look for words that should be grouped together, such as compound nouns, proper nouns, and complex event verbs. We propose a new metric for measuring the clustering quality in settings where the vocabularies of the models differ. Based on this metric and other established metrics, we show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/10/2014

Modeling Word Relatedness in Latent Dirichlet Allocation

Standard LDA model suffers the problem that the topic assignment of each...
research
04/23/2018

Discovering Style Trends through Deep Visually Aware Latent Item Embeddings

In this paper, we explore Latent Dirichlet Allocation (LDA) and Polyling...
research
06/23/2022

A Temporal Extension of Latent Dirichlet Allocation for Unsupervised Acoustic Unit Discovery

Latent Dirichlet allocation (LDA) is widely used for unsupervised topic ...
research
01/12/2017

Prior matters: simple and general methods for evaluating and improving topic quality in topic modeling

Latent Dirichlet Allocation (LDA) models trained without stopword remova...
research
06/28/2020

Mapping Topic Evolution Across Poetic Traditions

Poetic traditions across languages evolved differently, but we find that...
research
08/12/2018

Augmenting word2vec with latent Dirichlet allocation within a clinical application

This paper presents three hybrid models that directly combine latent Dir...

Please sign up or login with your details

Forgot password? Click here to reset