Multilingual Topic Models for Unaligned Text

05/09/2012
by   Jordan Boyd-Graber, et al.
0

We develop the multilingual topic model for unaligned text (MuTo), a probabilistic model of text that is designed to analyze corpora composed of documents in two languages. From these documents, MuTo uses stochastic EM to simultaneously discover both a matching between the languages and multilingual latent topics. We demonstrate that MuTo is able to find shared topics on real-world multilingual corpora, successfully pairing related documents across languages. MuTo provides a new framework for creating multilingual topic models without needing carefully curated parallel corpora and allows applications built using the topic model formalism to be applied to a much wider class of corpora.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/11/2018

Learning Multilingual Topics from Incomparable Corpus

Multilingual topic models enable crosslingual tasks by extracting consis...
research
04/19/2021

Few-shot Learning for Topic Modeling

Topic models have been successfully used for analyzing text documents. H...
research
05/23/2022

Unsupervised Tokenization Learning

In the presented study, we discover that the so-called "transition freed...
research
10/13/2018

Understanding Crosslingual Transfer Mechanisms in Probabilistic Topic Modeling

Probabilistic topic modeling is a popular choice as the first step of cr...
research
06/26/2018

Unveiling the semantic structure of text documents using paragraph-aware Topic Models

Classic Topic Models are built under the Bag Of Words assumption, in whi...
research
05/02/2020

An Improved Topic Masking Technique for Authorship Analysis

Authorship verification (AV) is an important sub-area of digital text fo...
research
07/03/2018

Topic Discovery in Massive Text Corpora Based on Min-Hashing

The task of discovering topics in text corpora has been dominated by Lat...

Please sign up or login with your details

Forgot password? Click here to reset