Cross-lingual Dataless Classification for Languages with Small Wikipedia Presence

11/13/2016
by   Yangqiu Song, et al.
0

This paper presents an approach to classify documents in any language into an English topical label space, without any text categorization training data. The approach, Cross-Lingual Dataless Document Classification (CLDDC) relies on mapping the English labels or short category description into a Wikipedia-based semantic representation, and on the use of the target language Wikipedia. Consequently, performance could suffer when Wikipedia in the target language is small. In this paper, we focus on languages with small Wikipedias, (Small-Wikipedia languages, SWLs). We use a word-level dictionary to convert documents in a SWL to a large-Wikipedia language (LWLs), and then perform CLDDC based on the LWL's Wikipedia. This approach can be applied to thousands of languages, which can be contrasted with machine translation, which is a supervision heavy approach and can be done for about 100 languages. We also develop a ranking algorithm that makes use of language similarity metrics to automatically select a good LWL, and show that this significantly improves classification of SWLs' documents, performing comparably to the best bridge possible.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/19/2022

Models and Datasets for Cross-Lingual Summarisation

We present a cross-lingual summarisation corpus with long documents in a...
research
09/14/2016

Transliteration in Any Language with Surrogate Languages

We introduce a method for transliteration generation that can produce tr...
research
02/12/2018

Automatic Generation of Language-Independent Features for Cross-Lingual Classification

Many applications require categorization of text documents using predefi...
research
03/06/2013

Japanese-Spanish Thesaurus Construction Using English as a Pivot

We present the results of research with the goal of automatically creati...
research
05/05/2017

Cross-lingual Distillation for Text Classification

Cross-lingual text classification(CLTC) is the task of classifying docum...
research
05/04/2020

WikiUMLS: Aligning UMLS to Wikipedia via Cross-lingual Neural Ranking

We present our work on aligning the Unified Medical Language System (UML...
research
12/02/2020

A Computational Approach to Measuring the Semantic Divergence of Cognates

Meaning is the foundation stone of intercultural communication. Language...

Please sign up or login with your details

Forgot password? Click here to reset