Automatic Generation of Language-Independent Features for Cross-Lingual Classification

02/12/2018
by   Sarai Duek, et al.
0

Many applications require categorization of text documents using predefined categories. The main approach to performing text categorization is learning from labeled examples. For many tasks, it may be difficult to find examples in one language but easy in others. The problem of learning from examples in one or more languages and classifying (categorizing) in another is called cross-lingual learning. In this work, we present a novel approach that solves the general cross-lingual text categorization problem. Our method generates, for each training document, a set of language-independent features. Using these features for training yields a language-independent classifier. At the classification stage, we generate language-independent features for the unlabeled document, and apply the classifier on the new representation. To build the feature generator, we utilize a hierarchical language-independent ontology, where each concept has a set of support documents for each language involved. In the preprocessing stage, we use the support documents to build a set of language-independent feature generators, one for each language. The collection of these generators is used to map any document into the language-independent feature space. Our methodology works on the most general cross-lingual text categorization problems, being able to learn from any mix of languages and classify documents in any other language. We also present a method for exploiting the hierarchical structure of the ontology to create virtual supporting documents for languages that do not have them. We tested our method, using Wikipedia as our ontology, on the most commonly used test collections in cross-lingual text categorization, and found that it outperforms existing methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/31/2019

Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and its Application to Cross-Lingual Text Classification

Cross-lingual Text Classification (CLC) consists of automatically classi...
research
11/13/2016

Cross-lingual Dataless Classification for Languages with Small Wikipedia Presence

This paper presents an approach to classify documents in any language in...
research
12/15/2020

Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies

With the ongoing growth in number of digital articles in a wider set of ...
research
12/29/2021

Application of Hierarchical Temporal Memory Theory for Document Categorization

The current work intends to study the performance of the Hierarchical Te...
research
07/19/2019

Multi-Granular Text Encoding for Self-Explaining Categorization

Self-explaining text categorization requires a classifier to make a pred...
research
09/26/2022

Cross-lingual Dysarthria Severity Classification for English, Korean, and Tamil

This paper proposes a cross-lingual classification method for English, K...
research
05/05/2017

Cross-lingual Distillation for Text Classification

Cross-lingual text classification(CLTC) is the task of classifying docum...

Please sign up or login with your details

Forgot password? Click here to reset