Crosslingual Topic Modeling with WikiPDA

09/23/2020
by   Tiziano Piccardi, et al.
0

We present Wikipedia-based Polyglot Dirichlet Allocation (WikiPDA), a crosslingual topic model that learns to represent Wikipedia articles written in any language as distributions over a common set of language-independent topics. It leverages the fact that Wikipedia articles link to each other and are mapped to concepts in the Wikidata knowledge base, such that, when represented as bags of links, articles are inherently language-independent. WikiPDA works in two steps, by first densifying bags of links using matrix completion and then training a standard monolingual topic model. A human evaluation shows that WikiPDA produces more coherent topics than monolingual text-based LDA, thus offering crosslinguality at no cost. We demonstrate WikiPDA's utility in two applications: a study of topical biases in 28 Wikipedia editions, and crosslingual supervised classification. Finally, we highlight WikiPDA's capacity for zero-shot language transfer, where a model is reused for new languages without any fine-tuning. Researchers can benefit from WikiPDA as a practical tool for studying Wikipedia's content across its 299 language editions in interpretable ways, via an easy-to-use library publicly available at https://github.com/epfl-dlab/WikiPDA.

READ FULL TEXT
research
02/26/2021

Language-agnostic Topic Classification for Wikipedia

A major challenge for many analyses of Wikipedia dynamics – e.g., imbala...
research
02/17/2020

What is Trending on Wikipedia? Capturing Trends and Language Biases Across Wikipedia Editions

In this work, we propose an automatic evaluation and comparison of the b...
research
06/06/2023

Orphan Articles: The Dark Matter of Wikipedia

With 60M articles in more than 300 language versions, Wikipedia is the l...
research
08/13/2016

Analysis of Morphology in Topic Modeling

Topic models make strong assumptions about their data. In particular, di...
research
06/29/2021

TWAG: A Topic-Guided Wikipedia Abstract Generator

Wikipedia abstract generation aims to distill a Wikipedia abstract from ...
research
04/27/2015

Exploring semantically-related concepts from Wikipedia: the case of SeRE

In this paper we present our web application SeRE designed to explore se...
research
03/05/2015

Studying the Wikipedia Hyperlink Graph for Relatedness and Disambiguation

Hyperlinks and other relations in Wikipedia are a extraordinary resource...

Please sign up or login with your details

Forgot password? Click here to reset